Posts Tagged data flow
Some algorithms for relational operators are especially appropriate for parallel execution,
either because they minimize data flow, or because they better tolerate data and execution skew.
Improved algorithms have been found for most of the relational operators. The evolution of join
operator algorithms is sketched here as an example of these improved algorithms.
Recall that the join operator combines two relations, A and B, to produce a third relation
containing all tuple pairs from A and B with matching attribute values. The conventional way of
computing the join is to sort both A and B into new relations ordered by the join attribute. These
two intermediate relations are then compared in sorted order, and matching tuples are inserted in
the output stream. This algorithm is called sort-merge join.
Many optimizations of sort-merger join are possible, but since sort has execution cost
nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a parallel
dataflow environment unless there is data skew. In case of data skew, some sort partitions may
be much larger than others. This in turn creates execution skew and limits speedup and scaleup.
These skew problems do not appear in centralized sort-merge joins.
Hash-join is an alternative to sort-merge join. It has linear execution cost rather than
nlog(n) execution cost, and it is more resistant to data skew. It is superior to sort-merge join
unless the input streams are already in sorted order. Hash join works as follows. Each of the
relations A and B are first hash partitioned on the join attribute. A hash partition of relation A is
hashed into memory. The corresponding partition of table relation B is scanned, and each tuple
is compared against the main-memory hash table for the A partition. If there is a match, the pair
of tuples are sent to the output stream. Each pair of hash partitions is compared in this way.
The hash join algorithm breaks a big join into many little joins. If the hash function is
good and if the data skew is not too bad, then there will be little variance in the hash bucket size.
In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup. Many
optimizations of the parallel hash-join algorithm have been discovered over the last decade. In
pathological skew cases, when many or all tuples have the same attribute value, one bucket may
contain all the tuples. In these cases no algorithm is known to speedup or scaleup.