Some algorithms for relational operators are especially appropriate for parallel execution,
either because they minimize data flow, or because they better tolerate data and execution skew.
Improved algorithms have been found for most of the relational operators. The evolution of join
operator algorithms is sketched here as an example of these improved algorithms.
Recall that the join operator combines two relations, A and B, to produce a third relation
containing all tuple pairs from A and B with matching attribute values. The conventional way of
computing the join is to sort both A and B into new relations ordered by the join attribute. These
two intermediate relations are then compared in sorted order, and matching tuples are inserted in
the output stream. This algorithm is called sort-merge join.
Many optimizations of sort-merger join are possible, but since sort has execution cost
nlog(n), sort-merge join has an nlog(n) execution cost. Sort-merge join works well in a parallel
dataflow environment unless there is data skew. In case of data skew, some sort partitions may
be much larger than others. This in turn creates execution skew and limits speedup and scaleup.
These skew problems do not appear in centralized sort-merge joins.
Hash-join is an alternative to sort-merge join. It has linear execution cost rather than
nlog(n) execution cost, and it is more resistant to data skew. It is superior to sort-merge join
unless the input streams are already in sorted order. Hash join works as follows. Each of the
relations A and B are first hash partitioned on the join attribute. A hash partition of relation A is
hashed into memory. The corresponding partition of table relation B is scanned, and each tuple
is compared against the main-memory hash table for the A partition. If there is a match, the pair
of tuples are sent to the output stream. Each pair of hash partitions is compared in this way.
The hash join algorithm breaks a big join into many little joins. If the hash function is
good and if the data skew is not too bad, then there will be little variance in the hash bucket size.
In these cases hash-join is a linear-time join algorithm with linear speedup and scaleup. Many
optimizations of the parallel hash-join algorithm have been discovered over the last decade. In
pathological skew cases, when many or all tuples have the same attribute value, one bucket may
contain all the tuples. In these cases no algorithm is known to speedup or scaleup.
For graph-based authentication, the main challenge is how to design a Directed Acyclic Graph (DAG)
with lowest overhead, highest verification probability and lowest sender and receiver delay.
However, there are tradeoffs between these performance criteria, which are summarized below.
- Computation complexity: The number of hash operations and signature operations required at the
sender and receiver. Note that computing a signature is much more complex than computing a hash.
- Overhead size: The extra bytes introduced by stream authentication, including the hashes and
signatures appended to the packets. The overhead size is determined by the number of edges in
the authentication graph. Note that a signature is much bigger in size than a hash.
- Verification percentage (or verification probability): the percentage of verifiable packets
among all the received packets. Intuitively, the more redundant paths a packet has to the
signature packet, the higher the probability of being verified.
- Sender delay: The delay at the sender (in number of packets) from the time when the packet
is produced by the encoder to the time that all authentication data appended to this packet
is ready. Real-time communication scenario requires low sender delay. For non-real-time
scenario, e.g., pre-encoded content for VOD applications, it is not important because the
sender has priori knowledge of all packets.
- Receiver delay: The delay at the receiver (in number of packets) from the time a packet is
received to the time that it can be verified. For authenticated video, each packet must be
received and pass the verification before its playout deadline.