Combining Computation and Communication Costs

We have so far been concerned with the communication costs incurred by repartitioning and have considered the cost of computing (i.e. disk and cpu costs other than communication) an operator to be independent of the partitioning attribute. We now extend our model to account for the interaction of these costs and show how the basic ideas of the ColorSplit algorithm carry over to the extended model to yield an efficient optimization algorithm.  Several alternate strategies, each with a different cost, may be available for an operator. The following example shows the interaction between between computation and communication costs using the standard scenario of having several strategies for computing operators such as joins and grouping.

Example 1 Given the schema Emp ( emp# , salary, dep# , city) and Dep(dep#, city), the following query finds the average salaries of employees grouped by city for those employees who live in the same city as the the location of their department.

Select e.city, avg( e.salary)
From Emp e, Dep d
Where e.dep# = d.dep# and e.city = d.city
Group by e.city;

Suppose Emp is partitioned by city and each partition is stored in sorted order by city. Suppose Dep is partitioned by dep# and each partition has an index on dep#. Figure 10 shows two query trees. The computation of Avg is assumed to be combined with GroupBy. The first query tree uses the join predicate on dep# and repartitions the Emp table. Due to the availability of an index on Dep, a
nested-loops strategy may be the cheapest for joining each partition of Emp (outer) with its corresponding partition of Dep (inner). The grouping operator is implemented by a hash-grouping strategy.

The second query tree uses the join predicate on city and repartitions the Dep table. Since each partition of Emp is presorted, it may be cheapest o use a sort-mergejo in for joining corresponding partitions. Since the output of merge join is pre-sorted in addition to being pre-partitioned on the city, the grouping operator uses a sort-grouping strategy.  The example illustrates several points. Firstly, while partitioning impacts communication costs, other physical properties (sort-order and indexes) impact computation costs. We will generalize the notion of a color to capture all physical properties.

Secondly, a strategy expects its inputs to have certain physical properties and guarantees its output to have some other properties. We will specify such input-output constraints using color patterns.

Thirdly, the overall cost is reduced when an input to a strategy happens to have the expected physical property. We will therefore break the cost of computing an operator into the intrinsic cost of the strategy itself and the cost of getting the inputs into the right form. The latter will be modeled as a recoloring cost that may or may not be incurred.