Christos H Papadimitriou Mihalis Yannakakis Multiobjective Query Optimization

Multiobjective Query Optimization

Источник:http://www.sigmod.org/pods/proc01/

Christos H. Papadimitriou Division of Computer Science U. C. Berkeley Berkeley, CA 94720 christos@cs.berkeley.edu

Mihalis Yannakakis Bell Laboratories Lucent Technologies Murray Hill, NJ 07974 mihalis@research.bell-labs.com

Abstract

The optimization of queries in distributed database sys- tems is known to be subject to delicate trade-os. For example, the Mariposa database system allows users to specify a desired delay-cost tradeo(that is, to supply a decreasing function u(d), specifying how much the user is willing to pay in order to receive the query results within time d); Mariposa divides a query graph into horizontal /strides," analyzes each stride, and uses a greedy heuristic to nd the /best" plan for all strides. We show that Mariposa's greedy heuristic can be ar- bitrarily far from the desired optimum. Applying a re- cent approach inmultiobjective optimization algorithms to this problem, we show that the optimum cost-delay trade-o (Pareto) curve in Mariposa's framework can be approximated fast within any desired accuracy. We also present a polynomial algorithm for the general mul- tiobjective query optimization problem, which approx- imates arbirarily well the optimum cost-delay tradeo (without the restriction of Mariposa's heuristic stride subdivision).

1 Introduction

In Computer Science we have always been interested in trade-os between various resources for the solution of computational problems; however, it was only very re- cently that trade-os have begun to be considered as computational problems in their own right |probably the result of the advent of the Internet and the en- suing increasingly complex socio-economic context of computation.1

Consider the query optimization problem, for exam- ple, arguably the most important and complex problem in databases. Trade-os between parallelism and com- munication in query optimization have been studied in [CHM, HM], and elsewhere. The Mariposa wide-area database system [SAP+] was architected to make such trade-os explicit in an advantageous way. Mariposa as- sumes that a subquery can be executed in many diverse database sites, and each site submits a /bid" for the query, specifying a delay for delivering the result, and an associated cost. The query optimizer then compares these bids to a user-supplied cost-delay trade-o (a non- increasing function u(d), specifying for each value d of the delay the amount of money the user is willing to pay in order to receive the query's results within time d), and attempts to determine the combination of subquery executions that maximizes the savings from this user- submitted curve. Mariposa's algorithm for this involves subdividing a query graph (obtained from a single-site optimizer) into horizontal /strides" that contain inde- pendent subqueries, compiling a cost-delay trade-o for each stride from the bids for these subqueries, and then combining the stride trade-os by a greedy heuristic.

Expecting a user to submit a desired trade-o curve seems to us a little unrealistic. Amuch better user inter- face would be, instead, presenting to the user the costs and delays of several query plans, for his/her choice. Obviously, we only need to present query plans that are undominated, in that each has the property that no other plan is better in terms of both cost and delay. The set of all undominated solutions is called in the eld of multiobjective (or multicriterion) optimization a Pareto curve |it captures the informal concept of a trade-o. Obviously, presenting to the user such a set of solutions to choose from is far superior to requiring a user-supplied trade-o and returning a solution that is in some rather arbitrary sense /most advantageous"with respect to the trade-o supplied, as Mariposa does.

By a fortunate coincidence, there has been recently much progress in the problem of computing Pareto curves. Even though there are often exponentially many solu- tions on a Pareto curve, and even the problem of nding the solution that optimizes one objective subject to the second being below a limit is NP-hard in very simple settings (e.g., shortest paths in a graph in which each edge has both a length and a delay), it is often possible to obtain in polynomial time an approximate trade-o, which is arbitrarily close to the true Pareto curve [PY].

In this paper we apply this approach in multiob- jective optimization to the query optimization problem. First, we characterize the set of plans produced by Mari- posa's greedy heuristic in terms of the Pareto curve, and show that for convex budget functions (a variant of) the heuristic nds the optimal solution (Theorem 1). How- ever, we observe that in general the heuristic can lead us arbitrarily far from the best trade-o (Theorem 4). In contrast, we present an algorithm which, given the trade-os from the strides as computed by Mariposa, constructs, for any , a set of query plans such that any other plan is approximately dominated (up to a factor ) by a plan in the constructed set (Theorem 5). This approximate solution is called an -Pareto curve.

But these results concern the optimization problem as specied by the Mariposa optimizer, after a heuris- tic step in which the query has been partitioned into horizontal strides. It would be much more interesting if we could compute the approximate optimum trade- o of the given query graph ab ovo, essentially opti- mizing also the partition into strides (of course, nd- ing the best query graph is a well-known hard prob- lem, which is orthogonal to our interests here). We develop a polynomial approximation algorithm for this problem, based on dynamic programming (Theorem 7). Our algorithm also works in the more complicated case in which the computation can carried out in many sites, and delays and costs due to communication must be taken into account. Of course these algorithms are ap- proximate; however, the approximation can become ar- bitrarily accurate at a reasonable computational cost; and the approximation is necessary, because the exact Pareto curves are both huge in size, and contain points that are hard to compute individually.

In Section 2 we describe the main relevant aspects of Mariposa; we also review the basic concepts and certain results from multi-objective optimization. In Section 3 we dene the query multiobjective optimization prob- lem in the context of Mariposa's query optimizer. We analyze the Mariposa heuristic in terms of the Pareto curve, and describe a polynomial algorithm for obtain- ing the -Pareto curve for this problem. Finally, in Sec- tion 4 we give the general algorithm for nding the - Pareto curve for a given query tree.

2 Background

2.1 Query Optimization in Mariposa

In Mariposa, a single-site optimizer rst transforms an SQL query into a query graph. Then a fragmenter fur- ther subdivides the nodes by taking into account the way in which (fragments of) the base relations are stored in the various sites; the fragmenter also groups indepen- dent operations in the fragmented query graph into par- allel strips called strides. Subqueries in a stride can be executed in parallel, but all subqueries of a stride must complete before the next stride can start (relaxing this to allow some pipelining is stated as a future problem in [SAP+]).

The atomic operations of the fragmented query plan are then passed to a broker, who invites various sites (assumed to belong to dierent corporate entities) to submit bids for each operation. A bid is a point in the cost-delay plane (plus an expiration date, a detail that we omit for simplicity). The broker combines the bids and produces a Pareto curve for the stride (that is, several plans, each with its own total cost and delay).

Last, a coordinator takes the Pareto curves for each stride, and attempts to combine them optimally to ob- tain the /optimum" execution plan. Among the possible execution plans, each with its cost c and delay d, Mari- posa seeks the one that has the following property: If u(d) is the amount the user would be willing (according to his/her trade-o curve submitted with the query) to pay for delay d, then the chosen query plan maximizes the dierence u(d)??c. In other words, Mariposa's goal is to choose the point that maximizes the /vertical" dis- tance from the user-supplied trade-o (this seems rea- sonable, yet rather arbitrary). This is the most intrigu- ing and crucial algorithmic step in Mariposa's query op- timizer.

Mariposa attempts to accomplish this optimization by a greedy algorithm: The coordinator rst determines the plan of minimum delay: It is the combination of all plans of minimum delay for each stride; its delay d0 is the sum of all minimum delays, and its cost c0 is the sum of all associated (maximum) costs. Thereafter, the algorithm moves iteratively from one global plan to an- other by substituting one component of the current plan by another bid in the same stride, as follows. For each stride, and for each bid for the stride, the algorithm computes the cost gradient for the bid, i.e., the ratio of the cost savings over the additional delay incurred if we use this bid for the stride in place of the bid in the current global plan. The algorithm nds the the bid b among all bids of all strides that has the maximum cost gradient, and considers the new global plan obtained from the current plan by using bid b for the correspond- ing stride instead of the one in the current plan. If the new global plan is not better than the original (i.e. does not yield a better prot u(d) ?? c) then the algo- rithm does not make the substitution and it terminates with the current plan. Otherwise, the algorithm makes the substitution, recomputes the cost gradients for the stride that contains the bid b that was substituted, and continues in the same manner with the new plan.

2.2 Multiobjective Optimization

Usually, an optimization problem is a way for associat- ing with each instance x a set of feasible solutions F(x), and, for each s 2 F(x), a cost c(s). More formally, an optimization problem is a pair of polynomial algorithms f; c, such that, for any instance x (say, an SQL query) and candidate solution s (say, a query plan), f(x; s) is a Boolean determining whether s 2 F(x) (i.e., it deter- mines whether s is indeed a valid query plan for x), and c(s) computes the cost of s (assumed to be a positive integer, e.g., an estimate of the number of operations needed to execute s).

A k-objective optimization problem is then a (k+1)- tuple of polynomial functions f; c1; : : : ; ck, where for each feasible solution s we have k dierent ways ci of evaluating its cost. That is, each solution s has an as- sociated k-vector of costs c(s) =< c1(s); : : : ; ck(s) > (a point in k-dimensional space). We say that a k- vector c =< c1; : : : ; ck > dominates another k-vector c0 =< c0 1; : : : ; c0 k > if ci c0 i for all i = 1; : : : ; k (with the inequality strict for some i). Similarly, we say that a solution s dominates another solution s0 if c(s) dom- inates c(s0). A feasible solution s 2 F(x), and its cost vector c(s), is called undominated if there is no other solution s0 that dominates it. Given an instance x of a multiobjective optimization problem, the Pareto curve of x, P(x), is the set of all undominated cost vectors of solutions in F(x). It is customary to use the term Pareto curve (in a slight abuse of notation) to refer also to a set of solutions that realize this set of undominated cost vectors. Intuitively, the points in the Pareto curve capture the notion of a /trade-o" between the various resources ci.

For a typical multiobjective problem (even the short- est path problem with two objectives), the Pareto curve has an exponential number of points, and furthermore it is NP-hard to compute a particular desired point in it (say, the path with smallest rst cost, and second cost bounded by B). Thus, multiobjective optimiza- tion seems computationally bleak. However, recently it was observed [PY] that all multiobjective optimiza- tion problems have a polynomially small -approximate Pareto curve: a set P(x) of solutions that approxi- mately dominate all other solutions up to a relative error of ; that is, formally, for every solution s 2 F(x) there is a solution s0 2 P(x) such that ci(s0) (1 + )ci(s) for all i = 1; : : : ; k. Thus, multiobjective optimiza- tion problems can be divided into those for which the -approximate Pareto curve can be computed in poly- nomial time for some reasonably small , and those for which it cannot. It is reasonable to consider the former to be /tractable."

Our results state that query optimization, starting from a specied query but with multiple choices for each base relation and operation varying with respect to cost and delay, is a tractable problem in this sense; the result holds both in the context of Mariposa's coordinator, and in general.

3 The Mariposa Coordinator Prob lem

Suppose that we have a query subdivided into n hor- izontal strides, and for each stride i we have a list of pairs Pi = ((ci1; di1); : : : ; (cim; dimi )) with cij ci;j+1 and dij di;j+1 for all i and j 6= mi; the pairs stand for the cost and delay of mi possible query plans for each stride i. A feasible (global) query plan (j1; j2; : : : ; jn) is a choice of a plan ji for each stride; its total cost is Pi ci;ji and its delay is Pi di;ji . This is a problem with two objectives, cost and delay. We call this the Mariposa coordinator problem.

We will rst analyse Mariposa's greedy algorithm and relate it to the Pareto curve. Recall that the al- gorithm constructs a sequence of solutions s0; s1; : : :, with associated set of points in the delay-cost plane p0 = (d(s0); c(s0)); p1 = (d(s1); c(s1)); : : :. The algo- rithm makes sure that there is improvement in each step in the prot (i.e. that he quantity u(di)??ci is increasing in each step ) and terminates when this is not the case. Let us ignore for now the user budget function and let the algorithm compute the complete sequence (this will be the case for example if u is a constant function.) If there are no ties among the bids in any steps, then this sequence is uniquely dened. If there are ties, then the set of solutions produced may depend on how the ties are broken. Let us assume for concreteness that when several bids from the same stride are tied, i.e. have the steepest descent, then the algorithm picks the one with the minimum cost; and ties among bids from dierent strides are broken arbitrarily.

For a multiobjective optimization problem, dene its convex Pareto curve CP(x) to be the Pareto curve of the convex hull of the set of cost vectors of the solutions, i.e. the lower left boundary of the convex hull. We identify CP(x) with its set of vertices and the associated set of solutions. Thus, CP(x) is a subset of the points (cost vectors) such that every other cost vector is dominated by a convex combination of some members of CP(x).

In our case, the convex Pareto curve for each of the strides, as well as for the global solution set, is a polyg- onal line in the delay-cost plane. Consider the following algorithm.

Algorithm M'

1. For each stride i, compute the convex Pareto curve CPi of the point set Pi. We consider the ver- tices as ordered in increasing delay (and decreas- ing cost), CPi = (pi0; : : : piri ). Let s0 be the global solution that consists of plan pi0 for each stride i; i.e., s0 is a minimum delay solution.

2. Let st be the current global solution. If there is a stride i for which the corresponding component pij of st is not the last vertex of CPi (equivalently, if t < Pi ri), then choose that stride for which the next edge of CPi has the steepest slope and replace pij by the next vertex pi;j+1 of CPi to form the new global solution st+1.

In case of ties in Step 2 between the strides (i.e. next edges with the same steepest slope), assume that Algorithm M' breaks them the same way as Mariposa.

Lemma 1 Algorithm M' produces the same sequence of solutions as the Mariposa algorithm.

Sketch of proof. By induction. The rst solution is clearly the same. The induction step follows from the fact that for any stride i and vertex pij of CPi , the plan of stride i that yields the steepest descent from pij is the next vertex of CPi. One advantage of Algorithm M' is that it yields a more ecient implementation than the straightforward algorithm, namely the complexity is O(n log n) where n is the size of the input. Suppose that there are l strides, each with m plans (thus the input has size lm). Then the convex Pareto curve CPi for each stride can be com- puted in time O(mlogm). The sequence of solutions can then be generated in time O(log l) per iteration, us- ing a priority queue to maintain the slopes of the next edges. Thus, the total time is O(lm(log l+logm)). The straightforward greedy algorithm would need an addi- tional O(m) time per iteration to recompute the gra- dients of the plans for the stride whose solution was modied; the total time would be O(lm2 + lm log l).

Theorem 1 The sequence of solutions computed by the Mariposa algorithm (and algorithm M') is the convex Pareto curve of the global solution set.

Sketch of proof. Every vertex v of the convex Pareto curve CP is the (unique) optimal solution for some lin- ear function of the cost and the delay a:c + b:d, with a; b 0. Vertex v has the property that the slope ??b=a of the objective function lies between the slopes of the two edges of CP that are incident to v (or v is the rst or last vertex of CP).

For linear objectives, the global optimization prob- lem is trivial because we can optimize separately in each stride. That is, v consists of a solution vi for each stride that optimizes a:c+b:d. Hence, vi is a vertex of CPi, all edges before vi have slope that is steeper than the slope of the objective function, while the next edge does not. Since Algorithm M' chooses edges in order of slope, it follows that it will generate at some point the solution v = (v1; : : : ; vl); that is, it will never proceed in a stride i beyond the vertex vi before all the other strides j reach the appropriate vertex vj .

It follows that every vertex of the convex Pareto curve CP is produced by the algorithm. In case of ties in some steps, Algorithm M' (and Mariposa) may pro- duce some additional intermediate points that lie in the middle of some edges of CP. In fact, the dierence between the solution sequences generated by dierent ways of breaking the ties is in such intermediate points that get produced.

One consequence is that, although the global solu- tion set is exponential (if there are l strides with m plans each, then there are ml solutions), the convex Pareto curve has a relatively small number of vertices - no more than the number of plans in the strides. (We note however that there may be an exponential number of solutions that correspond to intermediate points on the edges of the convex Pareto curve).

Corollary 2 The convex Pareto curve has a linear num- ber of vertices.

A second consequence concerns the optimality of Mariposa's greedy algorithm for certain user budget func- tions. Recall that Mariposa assumes that the user pro- vides a budget function u(d) (a non-increasing func- tion), and the algorithm aims to nd a solution that maximizes the prot u(d) ?? c.

Corollary 3 If the user function u(d) is linear, then Mariposa computes the optimal solution. If u(d) is con- vex, then the optimal solution occurs at a vertex of the convex Pareto curve, although the Mariposa algorithm may terminate prematurely before reaching the vertex.

Sketch of proof. In the linear case, the claim follows from the proof of the Theorem above. (Note however that in this case optimization is trivial.) Suppose u(d) is convex, i.e. u(d1+(1??)(d2)) u(d1)+(1??)u(d2) for any d1; d2 and 0 < < 1. A point p = (d; c) that is not on the convex Pareto curve CP is dominated by a convex combination p0 = v1 + (1 ?? )(v2) of two vertices v1; v2 of CP. Then the prot of p is no greater than the prot of p0 which in turn is no greater than the maximum of the prots of the vertices v1, v2. However, since the Mariposa algorithm stops when a step fails to improve the prot, it may terminate prematurely before reaching the optimal vertex. For example, suppose that u(d) = 8?? 2d if d 2, else 4, and the Pareto curve has three points (plans) with delay-cost (1,4), (2,3), (5,1). Then the algorithm will stop at the rst plan for a prot of 2 (since the second point is worse), although the op- timal plan is the third one for a prot of 4.

Thus, for nonlinear budget functions, it is better not to terminate the algorithm if there is no immediate im- provement, since a better point may be found later on (and the total number of generated global plans is in any case linear). For general (nonconvex) budget functions however even generating all the vertices of the convex Pareto curve may fail to produce an optimal solution. In fact the best vertex may be arbitrarily far from the optimum, or there may even not be a protable vertex, although there are other plans with a positive prot.

Theorem 4 For every number M > 0 there is a non- increasing function u(d) and a set of Pareto curves for strides such that there is a way of combining strides with total cost C and delay D such that u(D) ?? C > M [u(di)??ci] for all the solutions (di; ci) generated by the Mariposa algorithm .

Sketch of proof. Suppose that there are three global plans with delay-cost (1; 2M); (M + 1;M + 1); (2M + 1; 1), and the user budget function is u(d) = 2M + 1 if d M +1, else 4M +3??2d. Then the optimal plan is the second one with prot M, while the other two plans have prot 1. However, (M +1;M +1) does not belong to the convex Pareto curve as it is dominated by the average of the other two points.

Thus, in general one needs to consider points of the Pareto curve that are not on the convex curve. Further- more, a user interface that returns to the user the whole Pareto curve would be a major improvement over one that requires the user to supply a predened trade-o curve. How hard is it to compute the Pareto curve in this setting? It is not hard to see that this problem is in- tractable. First, it is easy to construct cases in which all exponentially many feasible solutions appear at the Pareto curve. Second, nding a single desired solution in the Pareto curve, say, the one that has cost less than B and is as fast as possible, given that constraint, is an NP-hard problem (easy reduction from the knapsack problem). In view of these negative results, it is interesting to compute the -Pareto curve of this biobjective opti- mization problem |that is to say, the overall cost-delay tradeo, approximated to within relative error.

Theorem 5 There is an algorithm that computes the -Pareto curve of the Mariposa coordinator problem in time polynomial in the size of the input and 1=.

Sketch of proof. One way of proving this is to real- ize that this problem can be reduced to the bicriterion shortest path problem (the problem facing a Web-based map service computing the distance/travel time trade- o of a trip). The reduction simulates each stride i by a set of mi parallel edges, with costs and delays re ecting those of the options available at each stride. As the lat- ter problem has a polynomial algorithm for nding its -Pareto curve [Han], the proof would be complete.

We also give here a dynamic programming algorithm, which will be generalized in the next section for gen- eral query trees. To start, we give a (pseudopolyno- mial) algorithm for nding the exact Pareto curve in time O(nL), where L is the smaller of the maximum cost and the maximum delay of any point in the trade- o curve. Assume without loss of generality that the maximum delay is the smaller of the two. The dy- namic programming algorithm computes in an array A[1 : : :L], the minimum cost of a plan for each de- lay d L. The computation is done one stride at a time. Having computed the information for the rst i ?? 1 strides we combine with the set of plans Pi of the ith stride: for d = 1 : : :L, A(d) = min(A(d ?? 1) do minfA(d ?? j) + cj(j; c) 2 Pi; j < dg).

We can compute an -Pareto curve as follows. Par- tition the range of delays from the minimum delay up to L geometrically with ratio 1 + = p1 + , i.e. we dene O(log1+ L) bounds bj with bj+1 = bj(1+). Let r = dl=e, where l is the number of strides. For each bj , do the following. For each plan of each stride, transform its delay from d to bdr=bjc. Compute the minimum cost global plan with transformed delay at most r; this can be done by the dynamic programming algorithm in time nr = O(nl=) (no need to compute array elements be- yond r). Doing this for all bj produces a set of global plans. Remove dominated plans from the set. The re- maining ones form an -Pareto set.

An approximate Pareto curve can be computed along similar lines also in the case where there is a set of sites, each subplan comes with an associated site that bids for it, and there are communication delays between the sites. The problem in this case can be still modeled as a bi-objective shortest path problem for a more compli- cated graph.

4 The Whole Hog

The query optimization problem solved in the previous section is only a subproblem of the Mariposa optimizer: It takes as given the partition into strides, for example. It would be more interesting if we could approximate a query's cost-delay trade-o in a more general setting. Suppose that we are given a query tree T (the base relations could be fragments of the relations in the orig- inal SQL query, as in Mariposa), and, for each node i (operation or input relations) we are given a list of pairs Pi = ((ci1; di1); : : : ; (cimi; dimi )) standing for the cost and delay of various options for implementing this op- eration, given its input data. In the general distributed case, we assume that there is a set S of sites and every pair (cij; dij) of Pi (i.e. plan for execution of node i) has an associated site sij which is bidding to execute the node, and for each node i and pair of sites j; k there is a cost c(i; j; k) and delay d(i; j; k) for shipping the result of node i from site j to site k. It is assumed that a site can execute plans locally in parallel. We wish to choose an option for each node so as to minimize both total cost and total delay of the resulting query plan. We call this the bicriterion query plan problem. It is a generalization of the Mariposa coordinator problem, since that problem is the special case in which the query graph is a straight line (with the strides standing for meta-operations).

We observed in the last section that for some func- tions of cost and delay (eg. linear and convex), one can perform an exact global optimization, and in fact (a variant of) the Mariposa algorithm accomplishes this. This is no more the case, even for simple extensions (and without the generalization to many sites and com- munication costs and delays.)

Theorem 6 The problem of nding a global solution that optimizes a given linear function of cost and delay for a given query tree is NP-hard.

Sketch of proof. The query tree T consists of a path (as in the previous section) and an additional child of the root. The plans of the path simulate an instance of the knapsack problem, and the other child of the root has a single available plan. The objective linear function seeks to rst minimize delay and then break ties by cost.

The problem is only weakly NP-hard; optimization in general can be solved in pseudopolynomial time as we will see below. As it turns out, a dynamic programming technique generalizing the one in the proof of Theorem 5 establishes that the more general bicriterion query plan problem can be eciently approximated:

Theorem 7 There is an algorithm that computes the - Pareto curve of the bicriterion query plan problem in time polynomial in the size of the input and 1=.

Sketch of proof. We give rst a pseudopolynomial algorithm. Let L be the delay of the minimum cost plan (this can be found in polynomial time). For each node i, site s and delay d, let A(i; s; d) be the minimum cost of any plan that completes the computation of node i at site s within time d. These quantities can be computed bottom up by dynamic programming. For the leaves i use the given plans of Pi and the communication costs and delays. For an internal node i with set of children C(i), for d = 1: : :L, compute A(i; s; d) as the minimum of the following quantities: 1. (A(i; s; d ?? 1), 2. the minimum over all other sites j of A(i; j; d ?? d(i; j; s)) + c(i; j; s), 3. the minimum over all plans (cij; dij ) of Pi with site s of the quantities cij+Sigmav2C(i)fA(v; s; d?? dij )g. To compute the -Pareto curve, we partition the range of delays geometrically as in Theorem 5, and then we scale and round the delays, and proceed in a similar manner as before.

5 Conclusion

We showed that the query optimization problem can be usefully studied within the scope of multiobjective op- timization. It is interesting to recall here the work of [EHJ+]: We are given n web sources (horizontal frag- ments of a relation) each with its own delay, cost, and quality (roughly, the probability that a tuple will appear there). They study the tradeo between cost (total), de- lay (maximum) and quality of the union of subsets of the fragments. It is shown in [PY] that this problem can be approximated in polynomial time. This problem can be viewed as the special case of a simple query, but with the additional twist that alternative sites (bids) are not equivalent but oer a range along a third dimension of quality. The more general query optimization problem could be studied also in a similar framework.