Archive for March, 2012
Some algorithms for relational operators are especially appropriate for parallel execution,
either because they minimize data flow, or because they better tolerate data and execution skew.
Improved algorithms have been found for most of the relational operators. The evolution of join
operator algorithms is sketched here as an example of these improved algorithms.
Recall that the join operator combines two relations, A and B, to produce a third relation
containing all tuple pairs from A and B with matching attribute values. The conventional way of
computing the join is to sort both A and B into new relations ordered by the join attribute. These
two intermediate relations are then compared in sorted order, and matching tuples are inserted in
the output stream. This algorithm is called sort-merge join.
Many optimizations of sort-merger join are possible, but since sort has execution cost
nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a parallel
dataflow environment unless there is data skew. In case of data skew, some sort partitions may
be much larger than others. This in turn creates execution skew and limits speedup and scaleup.
These skew problems do not appear in centralized sort-merge joins.
Hash-join is an alternative to sort-merge join. It has linear execution cost rather than
nlog(n) execution cost, and it is more resistant to data skew. It is superior to sort-merge join
unless the input streams are already in sorted order. Hash join works as follows. Each of the
relations A and B are first hash partitioned on the join attribute. A hash partition of relation A is
hashed into memory. The corresponding partition of table relation B is scanned, and each tuple
is compared against the main-memory hash table for the A partition. If there is a match, the pair
of tuples are sent to the output stream. Each pair of hash partitions is compared in this way.
The hash join algorithm breaks a big join into many little joins. If the hash function is
good and if the data skew is not too bad, then there will be little variance in the hash bucket size.
In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup. Many
optimizations of the parallel hash-join algorithm have been discovered over the last decade. In
pathological skew cases, when many or all tuples have the same attribute value, one bucket may
contain all the tuples. In these cases no algorithm is known to speedup or scaleup.
Full-text search is a common operation in document and content-centric XML applications. DB2’s existing
text search capabilities have been extended to work with the new XML column type. Full-text indexes
with awareness of XML document structures can be defined on any native XML column. The documents in an
XML column can be fully indexed or partially indexed, e.g. if it is known in advance that only a
certain part of each document will be subject to full-text search, such as a “description” or “comment”
element. Correspondingly, text search expressions can be applied to specific paths in a document.
The following statement defines a text index which fully indexes the documents in the XML column deptdoc
in our table dept in the database personneldb:
create index myIndex for text on dept (deptdoc) format xml connect to personneldb
The following query exploits this index but restricts the search to a specific element. The query
retrieves all documents where the element ‘/dept/comment’ contains the word “Brazil”:
select deptdoc from dept where
contains (deptdoc,‘sections(”/dept/comment”) “Brazil” ‘) = 1
Text search in specific parts of the documents is a critical feature for many applications. Standard
text search features are also available, such as scoring and ranking of search results as well as
thesaurus-based synonym search.
For best performance of XML insert, update, and delete operations the text index is maintained
asynchronously, i.e. not within the context of a DML transaction. However an “update index” command
is available to force synchronization of the text index.
The Knowledge Base is a dynamic part of the system that can be supplemented and refreshed through The
Intelligent KB Editor. We should notice that there are two potential sources of knowledge to be discovered for
the proposed system. These are the analysis of theory background that lies behind the feature extraction and
classification methods, and field experiments.
In the first case, knowledge is formulated by an expert in the area of the specific feature extraction methods
and classification schemes, and then represented as a set of rules by a knowledge engineer in the terms of a
knowledge representation language that is supported by the system. We argue that it is possible and
reasonable to categorise the facts and rules that are present in the Knowledge Base. Categorisation can be
done according to the way the knowledge has been obtained – has it been got from the analysis of
experimental results of from the domain theory, was it put automatically by the Intelligent KB Editor or by a
knowledge engineer (who could be a data miner as well). Another categorisation criterion is the level of
confidence of a rule. The expert can be sure in a certain fact or may just think or to hypothesize about another
fact. In a similar way, a rule that has been just generated from the analysis of results by experimenting on
artificially generated data sets but has been never verified on real-worlds data sets and a rule that has been
verified on a number of real-world problems. These two rules definitely should not have the same level of
In addition to the “trust“ criteria due to the categorisation of the rules it is possible to adapt the system to a
concrete researcher needs and preferences by giving higher weights to the rules that actually are the ones of
And, in the second case, a data miner can discover knowledge during the analysis of results obtained from
the experiments as separate facts, trends and dependencies. In the same manner, discovered knowledge is
represented as a set of rules by a knowledge engineer using of the knowledge representation language.
Alternatively, the knowledge acquisition process can be automatic, i.e. the knowledge discovery process
would be accomplished without any interference with a human expert. This may happen using the possibility
of deriving new rules and updating the old ones based on the analysis of results obtained during the self-run
In both the last cases we have a problem of learning how the Intelligent KB Editor should try to build up a
classification or a regression model on meta-data resulted from experiments. In this context the input
parameters for a classification model are specific data set characteristics and a classification model’s outputs
that include accuracy, sensitivity, specificity, time complexity, etc. The combination of a feature extraction
method’s and a classification model’s names with their parameter values represents a class label. When
building a regression model – meta-data-set attributes are data set characteristics, the feature extraction
method’s and the classification model’s names, and one of the model output characteristics is the attribute
which value (continuous) has to be predicted.
The results obtained to the present stage of research show a high level of complexity in
dependencies between the data set characteristics and the best-suited scheme for the data mining process.
In order to further develop our understanding it is necessary to proceed the research with the following
- Generation of artificial data sets with known characteristics (simple, statistical and information theoretic
- Design of experiments on the generated artificial data sets;
- Derivation of dependencies and definition of the criteria from the obtained results;
- Development of a knowledge base defining a set of rules on the set of obtained criteria;
- Proof of the constructed theory with a set of experiments on real-world data sets.
Thus, three basic research methods are used in the research: the theoretical approach, the constructive
approach, and the experimental approach. These approaches are closely related and are applied in parallel.
The theoretical backgrounds are exploited during the constructive work and the constructions are used for
experimentation. The results of constructive and experimental work are used to refine the theory.
An example of such a procedure can be presented as:
- Generation of artificial data sets with the number of attributes from 2 to 100, with the number of
instances from 150 to 5000, with the number of classes from 2 to 10, with the average correlation
between the attributes from 10% to 90%, with the average noisiness of attributes from 10% to 50%,
with the percent of irrelevant attributes from the total number of attributes from 10% to 50%.
- Design of the experiments on generated artificial data sets and analysing accuracy and efficiency of
the classification model built on different learning algorithms and using different feature extraction
methods. Tuning of the input parameters for each combination is required.
- Analysis of the dependencies and trends between output accuracies and efficiencies, feature
extraction methods and classifiers, their input parameters, and pre-defined data set characteristics.
- Definition of a set of rules that reflect found dependencies and trends.
- Execution of a number of experiments on UCI data sets using DSS for the best-suited feature
extraction method and classifier selection.
- Addition of the invented rules that were successfully validated during the tests on the benchmark
data sets to the knowledge base.
High-end graphics workstations (such as the Silicon Graphics RealityEngine) are typically optimized
for the display of textured surfaces, while low-end workstations (such as the Silicon Graphics Indy)
are typically optimized for the display of untextured surfaces. Given these capabilities, the most
obvious way to partition rendering between a high-end server and a low-end client is to omit surface
texture on the client. To demonstrate this, we consider a room composed of flat surfaces that exhibit
smooth shading and texture (see figure 2). The model contains 1131 polygons with a fixed color at each
vertex. This color was calculated using a hierarchical radiosity algorithm that approximates the
diffuse interreflection among textured surfaces. The high-quality rendering (figure 2a) employs
antialiasing, Gouraud-interpolated shading, and texturing. The low-quality rendering (figure 2b)
employs antialiasing and Gouraud-interpolated shading but no texturing. The difference between the two
renderings is shown in figure 2c.
Figures 2d through 2g show image-based compression of the high-quality rendering using varying JPEG
quality factors. Figures 2h through 2k show polygon-assisted compression using quality factors selected
to match as closely as possible the code sizes in figures 2d through 2g, assuming that the geometric model
resides on both machines. The quality factors, code sizes, and compression rates are given below each
image. Figures 2l and 2m give one more pair, enlarged so that details may be seen.
Table I: Image-based compression versus polygon-assisted compression, compared for the scenes pictured in
figures 2 and 3. In D, we select a quality factor that gives 20 frames per second while requiring 2 Mbs or
less of network bandwidth. In E, we select a quality factor that matches as closely as possible the code
size obtained in D, assuming that the low-quality geometric model resides on both client and server. In F,
we estimate the number of bytes required to generate D assuming that a losslessly compressed
low-quality model is transmitted from server to client.
In every case, polygon-assisted compression is superior to image-based compression. There are two distinct
reasons for this:
1. The polygon-assisted rendering contains undegraded edges and smoothly shaded areas – precisely those
features that fare poorly in JPEG compression.
2. The difference image contains less information than the highquality rendering, so it can be compressed
using a higher JPEG quality factor without increasing code size – even higher than is required to compensate
for the division by 2 in the difference image representation. Thus, texture features, which are present only
in the difference image, fare better using our method.
As an alternative to comparing images at matching code sizes, we can compare the code sizes of images of
equal quality. Unfortunately, such comparisons are difficult because the degradations of the two methods
are different – polygon-assisted compression always produces perfect edges and smooth shading, while JPEG
never does. If one allows that figure 2j generated using polygon-assisted compression is comparable in
quality to figure 2d generated using image-based compression, then our method gives an additional 3x
compression for this scene.
Table I also estimates the number of bytes required to generate figure 2m if the model is transmitted from
server to client using the lossless compression method proposed. This size (13529 bytes) lies between the
code sizes of figures 2e and 2f. Even in this case, polygon-assisted compression is superior in image quality
to image-based compression, both in terms of its edges and smooth shading and in terms of the JPEG quality
factor used to transmit the texture information.
A fundamental limit of indirect statistics-based feedback is that it classifies link conditions as
either “good” or “bad.” This binary information provides some notion about the direction in which
to adapt the rate setting, but does not suffice to select the appropriate rate at once. This leads
to a slow step-by-step accommodation to large changes in conditions, and introduces the risk of
oscillation in stable conditions. A better approach is to use direct measurements of the link
conditions.The SNR is directly related to the bit error rate (BER) in the link, and hence to the
FER. Consequently, the SNR is linked to the packet delay and jitter, and throughput, and has the
potential to provide rich feedback for automatic rate control. Knowing the current SNR and the
throughput vs SNR curves for each rate setting (e.g., Fig. 2) would solve the rate selection
Figure 2. Throughput vs. SNR for some 802.11 modulation schemes
Despite the advantages, SNR-based rate control has not been applied in practice yet because of the
following three problems:
1) In reality, for certain link conditions the relation between the optimal rate and SNR is highly
variable. This is due to the imperfectness of the models describing the radio channel.
2) It is not trivial to obtain a reliable estimate of the SNR of a link. Many radio interfaces only
provide an uncalibrated signal strength indication (SSI).
3) The rate controller, which is on the sending side, in fact needs the SNR observed at the
Most work on using SNR information for automatic rate control is based on simulation and does not
consider the practical difficulties of obtaining good SNR estimates. It concentrates on the way in
which the noisy and drifting SNR can be used to determine the correct rate setting to address the
issue of how to communicate back SNR values, but their rate selection algorithm still relies on a
straight SNR threshold technique. Another approach is discussed in, where the assumption is made
that the channel is symmetric, meaning that the SNR observed at either station is very similar for
any given point in time. This assumption allows to use the SNR of the last ACK frame as an indicator
of the SNR at the other side, and to use it for selecting the rate of the next data frame to be sent.