The History of Histograms (abridged)
Yannis Ioannidis
Department of Informatics and Telecommunications, University of Athens
Panepistimioupolis, Informatics Buildings
157-84, Athens, Hellas (Greece)
yannis@di.uoa.gr
Abstract
The history of histograms is long and rich, full of detailed information in every step. It in­cludes the course of histograms in different scientific fields, the successes and failures of histograms in approximating and compressing information, their adoption by industry, and solutions that have been given on a great va­riety of histogram-related problems. In this paper and in the same spirit of the histogram techniques themselves, we compress their en­tire history (including their "future history" as currently anticipated) in the given/fixed space budget, mostly recording details for the periods, events, and results with the highest (personally-biased) interest. In a limited set of experiments, the semantic distance between the compressed and the full form of the history was found relatively small!
word that was originally used in the Greek language1. The term 'histogram' was coined by the famous statis­tician Karl Pearson2 to refer to a "common form of graphical representation". In the Oxford English Dic­tionary quotes from "Philosophical Transactions of the Royal Society of London" Series A, Vol. CLXXXVI, (1895) p. 399, it is mentioned that "[The word 'his­togram' was] introduced by the writer in his lectures on statistics as a term for a common form of graphical representation, i.e., by columns marking as areas the frequency corresponding to the range of their base.". Stigler identifies the lectures as the 1892 lectures on the geometry of statistics [69].
The above quote suggests that histograms were used long before they received their name, but their birth date is unclear. Bar charts (i.e., histograms with an individual 'base' element associated with each column) most likely predate histograms and this helps us put a lower bound on the timing of their first appearance. The oldest known bar chart appeared in a book by the Scottish political economist William Playfair3 titled "The Commercial and Political Atlas (London 1786)" and shows the imports and exports of Scotland to and from seventeen countries in 1781 [74]. Although Play-fair was skeptical of the usefulness of his invention, it was adopted by many in the following years, includ­ing for example, Florence Nightingale, who used them in 1859 to compare mortality in the peacetime army to that of civilians and through those convinced the government to improve army hygiene.
From all the above, it is clear that histograms were first conceived as a visual aid to statistical approxima­tions. Even today this point is still emphasized in the common conception of histograms: Webster's defines a
1 Prehistory
The word 'histogram is of Greek origin, as it is a com­posite of the words 'isto-s' (iaros) (= 'mast', also means 'web' but this is not relevant to this discus­sion) and 'gram-ma' (^ра/л/ла) (= 'something writ­ten'). Hence, it should be interpreted as a form of writing consisting of 'masts', i.e., long shapes vertically standing, or something similar. It is not, however, a
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the VLDB copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Very Large Data Base Endowment. To copy otherwise, or to republish, requires a fee and/or special permission from the Endowment.
Proceedings of the 29th VLDB Conference, Berlin, Germany, 2003
history-1.jpg
1To the contrary, the word 'history' is indeed part of the Greek language ('istoria' - ьаторьа) and in use since the ancient times. Despite its similarity to 'histogram', however, it appears to have a different etymology, one that is related to the original meaning of the word, which was 'knowledge'.
2His claim to fame includes, among others, the chi-square test for statistical significance and the term 'standard deviation'.
3In addition to the bar chart, Playfair is probably the fa­ther of the pie chart and other extremely intuitive and useful visualizations that we use today.
histogram as "a bar graph of a frequency distribution in which the widths of the bars are proportional to the classes into which the variable has been divided and the heights of the bars are proportional to the class frequencies". Histograms, however, are extremely use­ful even when disassociated from their canonical visual representation and treated as purely mathematical ob­jects capturing data distribution approximations. This is precisely how we approach them in this paper.
In the past few decades, histograms have been used in several fields of informatics. Besides databases, his­tograms have played a very important role primarily in image processing and computer vision. Given an image (or a video) and a visual pixel parameter, a his­togram captures for each possible value of the param­eter (Webster's "classes") the number of pixels that have this value (Webster's "frequencies"). Such a his­togram is a summary that is characteristic of the image and can be very useful in several tasks: identifying sim­ilar images, compressing the image, and others. Color histograms are the most common in the literature, e.g., in the QBIC system [21], but several other parameters have been proposed as well, e.g., edge density, tex-turedness, intensity gradient, etc. [61]. In general, his­tograms used in image processing and computer vision are accurate. For example, a color histogram contains a separate and precise count of pixels for each possi­ble distinct color in the image. The only element of approximation might be in the number of bits used to represent different colors: fewer bits imply that several actual colors are represented by one, which will be as­sociated with the number of pixels that have any of the colors that are grouped together. Even this kind of approximation is not common, however.
In databases, histograms are used as a mechanism for full-fledged compression and approximation of data distributions. They first appeared in the literature and in systems in the 1980's and have been studied exten­sively since then at a continuously increasing rate. In this paper, we concentrate on the database notion of histograms, discuss the most important developments on the topic so far, and outline several problems that we believe are interesting and whose solution may fur­ther expand their applicability and usefulness.
2 Histogram Definitions 2.1 Data Distributions
Consider a relation R with n numeric attributes Xi (i = 1..n). The value set Vi of attribute Xi is the set of values of Xi that are present in R. Let Vi = {vi(k): 1 < k < Di }, where vi(k) < vi(j) when к < j. The spread si(k) of vi(k) is defined as si(k) = vi(k + 1) - vi(k), for 1 < k < Di. (We take si(Di) = 1.) The frequency fi(k) of vi(k) is the number of tuples in R with Xi = vi(k). The area ai(k) of vi(k) is defined as ai(k) = fi(k) x si(k).
The data distribution of Xi is the set of pairs % =
The joint frequency f(k1,..,kn) of the value combi­nation < v1(k1), ..,vn(kn) > is the number of tuples in R that contain vi(ki) in attribute Xi, for all i. The joint data distribution T1,..,n of X1,..,Xn is the entire set of (value combination, joint frequency) pairs.
In the sequel, for 1-dimensional cases, we use the above symbols without the subscript i.
2.2     Motivation for Histograms
Data distributions are very useful in database systems but are usually too large to be stored accurately, so histograms come into play as an approximation mech­anism. The two most important applications of his­togram techniques in databases have been selectivity estimation and approximate query answering within query optimization (for the former) or pre-execution user-level query feedback (for both). Our discussion below focuses exactly on these two, especially range-query selectivity estimation as this is the most pop­ular issue in the literature. It should not be forgot­ten, however, that histograms have proved to be useful in the context of several other database problems as well, e.g., load-balancing in parallel join query execu­tion [65], partition-based temporal join execution [68] and others.
2.3    Histograms
A histogram on an attribute X is constructed by parti­tioning the data distribution of X into /3 (> 1) mutu­ally disjoint subsets called buckets and approximating the frequencies and values in each bucket in some com­mon fashion. This definition leaves several degrees of freedom in designing specific histogram classes as there are several possible choices for each of the following (mostly orthogonal) aspects of histograms [67]:
Partition Rule: This is further analyzed into the following characteristics:
•  Partition Class: This indicates if there are any restrictions on the buckets. Of great importance is the serial class, which requires that buckets are non-overlapping with respect to some parameter (the next characteristic), and its subclass end-biased, which requires at most one non-singleton bucket.
•  Sort Parameter: This is a parameter whose value for each element in the data distribution is derived from the corresponding attribute value and frequencies. All serial histograms require that the sort parameter values in each bucket form a contiguous range. Attribute value (V), frequency (F), and area (A) are examples of sort parameters that have been discussed in the literature.
•  Source Parameter: This captures the property of the data distribution that is the most critical
in an estimation problem and is used in conjunc­tion with the next characteristic in identifying a unique partitioning. Spread (S), frequency (F), and area (A) are the most commonly used source parameters.
Partition Constraint: This is a mathematical constraint on the source parameter that uniquely identifies a single histogram within its partition class. Several partition constraints have been pro­posed so far, e.g., equi-sum, v-optimal, maxdiff, and compressed, which are defined further below as they are introduced. Many of the more suc­cessful ones try to avoid grouping vastly different source parameter values into a bucket. Following [67], we use p(s,u) to denote a serial his­togram class with partition constraint p, sort parame­ter s, and source parameter u.
Construction Algorithm: Given a particular partition rule, this is the algorithm that constructs histograms that satisfy the rule. It is often the case that, for the same histogram class, there are several construction algorithms with different efficiency.
Value Approximation: This captures how at­tribute values are approximated within a bucket, which is independent of the partition rule of a his­togram. The most common alternatives are the con­tinuous value assumption and the uniform spread as­sumption; both assume values uniformly placed in the range covered by the bucket, with the former ignoring the number of these values and the later recording that number inside the bucket.
Frequency Approximation: This captures how frequencies are approximated within a bucket. The dominant approach is making the uniform distribution assumption, where the frequencies of all elements in the bucket are assumed to be the same and equal to the average of the actual frequencies.
Error Guarantees: These are upper bounds on the errors of the estimates a histogram generates, which are provided based on information that the his­togram maintains.
A multi-dimensional histogram on a set of at­tributes is constructed by partitioning the joint data distribution of the attributes. They have the exact same characteristics as 1-dimensional histograms, ex­cept that the partition rule needs to be more intricate and cannot always be clearly analyzed into the four other characteristics as before, e.g., there is no real sort parameter in this case, as there can be no order­ing in multiple dimensions [66].
3 The Past of Histograms First Appearance
To the best of our knowledge, the first proposal to use histograms to approximate data distributions within a database system was in Kooi's PhD thesis [47]. His
proposal was an immediate loan from statistics of the simplest form of histogram, with the value set being divided into ranges of equal length, i.e., the so called equi-width histograms. Hence, in terms of the tax­onomy of Section 2.3, the entry point for histograms into the world of databases was the serial class of equi-sum(V,S), where the equi-sum partition constraint re­quires that the sums of the source-parameter values (spreads in this case) in each bucket are equal. Within each bucket, values and frequencies were approximated based on the continuous value assumption and the uni­form distribution assumption, respectively.
Equi-width histograms represented a dramatic im­provement over the uniform distribution assumption for the entire value set (i.e., essentially a single-bucket histogram), which was the state of the practice at the time. Hence, they were quickly adopted by the Ingres DBMS in its commercial version, and later on by other DBMSs as well.
First Alternative
A few years after Kooi's thesis, the first alternative histogram was proposed, changing only the source parameter [62]. Instead of having buckets of equal-size ranges, the new proposal called for buckets with (roughly) the same number of tuples in each one, i.e., the so called equi-depth or equi-height histograms. In terms of the taxonomy, these are the equi-sum(V,F) histograms. There was ample evidence that equi-depth histograms were considerably more effective than equi-width histograms, hence, many commercial vendors switched to those in the years following their intro­duction. Equi-depth histograms were later presented in their multi-dimensional form as well [58].
Optimal Sort Parameter
After several years of inactivity on the topic of his­tograms, interest in it was renewed in the context of studying how initial errors in statistics maintained by the database propagate in estimates of the size of com­plex query results [36]. In particular, it was shown that, under some rather general conditions, in the worst case, errors propagate exponentially in the query size (i.e., in the number of joins), removing any hope for high-quality estimates for large multi-join queries. The first results that led towards new types of his­tograms were derived in an effort to obtain statistics that would be optimal in minimizing/containing the propagation of errors in the size of join results [37]. The basic mathematical tools used were borrowed from majorization theory [55]. The focus was on a rather restricted class of equality join queries, i.e., single-join queries or multi-join queries with only one attribute participating in joins per relation (more generally, with a 1-1 functional dependency between each pair of join attributes of each relation). For this query class, and
under the assumption that the value set is known ac­curately, it was formally proved that the optimal his­togram was serial and had frequency as the sort pa­rameter4.
Ten years ago
The above result might have not had the impact it did if it had remained true only for the restricted query class it was first proved for. Soon afterwards, however, in VLDB'93, it was generalized for arbitrary equality join queries, giving a strong indication that the most effective histograms may be very different from those that were used until that point [34].
To the best of our knowledge, histograms with fre­quency as the sort parameter represented the first de­parture from value-based grouping of buckets, not only within the area of databases, but overall within math­ematics and statistics as well. Furthermore, their in­troduction essentially generalized some common prac­tices that were already in use in commercial systems (e.g., in DB2), where the highest frequency values were maintained individually and accurately due to their significant contribution to selectivity estimates. Such a practice is an instance of a special case of a histogram in the end-biased partition class, with frequency as the sort parameter: the highest sort-parameter values are maintained in singleton buckets. Although less accu­rate than general serial histograms, in several cases, end-biased histograms proved quite effective.
New Partition Constraints
The results on the optimality of frequency as the sort parameter left open two important questions. First, which partition constraints are the most effective, i.e., which ones among all possible frequency-based bucke-tizations? Second, which histograms are optimal when the value set is not accurately maintained but is ap­proximated in some fashion?
The answer to the first question came in the form of the v-optimal histograms, which partition the data distribution so that (roughly) the variance of source-parameter values within each bucket is minimized [38].
Unfortunately, the second question had no analyt­ical answer, but extensive experimentation led to the formation of the space of histogram characteristics that we use as the basic framework for our discussion in this paper (Section 2.3) [67]. In addition to the equi-sum and v-optimal partition constraints, it introduced several possible new ones as well, which similarly to v-optimal had as a goal to avoid grouping together in the same bucket vastly different source-parameter values. Among them, we distinguish maxdiff, which places bucket boundaries between adjacent source-parameter
values (in sort-parameter order) whose difference is among the largest, and compressed, which puts the highest source values in singleton buckets and parti­tions the rest in equi-sum fashion. Overall, the new partition constraints (i.e., v-optimal, maxdiff, com­pressed) were shown to be the most effective in curbing query-result-size estimation errors.
The same effort pointed towards several possibilities for the sort and source parameters, i.e., value, spread, frequency, area, cumulative frequency, etc., with fre­quency and area being the best source parameters. In­terestingly, the best sort parameter proved to be the value and not the frequency, as the original optimal­ity results would suggest, indicating that, if values are not known accurately, having buckets with overlapping value ranges does not pay off for range queries.
The most effective of these histograms have actu­ally been adopted by industrial products (see Section 4). Furthermore, in addition to selectivity estimation for various relational and non-relational queries, these histograms have proved to be very effective in approx­imate query answering as well [39].
Since the specification of the above space of his­tograms, there have been several efforts that have studied one or more of its characteristics and have proposed alternative, improved approaches. For each characteristic, we outline some of the most notable pieces of work on it in a separate subsection below. Unless explicitly mentioning the opposite, the discus­sion is about 1-dimensional histograms.
Alternative Partition Constraints
In addition to the partition constraints that were in­troduced as part of the original histogram framework [67], a few more have been proposed that attempt to approach the effectiveness of v-optimal, usually hav­ing a more efficient construction cost. Among them, we note one that uses a simplified form of the opti­mal knot placement problem [18] to identify the bucket boundaries, which are where the 'knots' are placed [46]. The simplification consists of using only linear splines that are also allowed to be discontinuous across bucket boundaries. This is combined with interesting alternatives on the value and frequency approximation within each bucket.
Multi-Dimensional Partition Rules
The first introduction of multi-dimensional histograms was by Muralikrishna and DeWitt [58], who essentially described 2-dimensional equi-depth histograms. Space was divided in the same way it is done in a Grid-file, i.e., recursively cutting the entire space into half-spaces by using a value of one of the dimensions as a boundary each time, the dimension and the value be­ing chosen in a way prespecified at the beginning of the process [58]. Buckets were non-overlapping (the multi­dimensional version of the serial partition class) on
history-2.jpg
4These were called simply serial histograms at the time, but the term was later generalized to imply non-overlapping ranges of any sort parameter, not just frequency, which is how we use the term in this paper as well.
the space of the multi-dimensional values (the multi­dimensional version of value as the sort parameter), the boundaries chosen with equi-sum as the partition constraint and frequency as the source parameter.
It was not until several years later that any new par­tition rules were proposed [66], this time taking advan­tage of the generality of the histogram taxonomy [67]. The most effective family of such rules was MHIST-2, which starts from the entire joint data distribution placed in a single bucket and, at each step, splits the space captured by one of the buckets it has formed into two subspaces, until it has exhausted its budget of buckets. The split is made in the bucket and along the dimension that is characterized as most "critical", i.e., whose marginal distribution is the most in need of partitioning, based on the (1-dimensional) partition constraint and source parameter used. In combination with the most effective partition constraints and source parameters (i.e., v-optimal or maxdiff with frequency or area), MHIST-2 represented a dramatic improve­ment over the original multi-dimensional equi-depth histograms.
Since MHIST, there have been several other inter­esting partition rules that have been proposed. One of them is GENHIST [31], which was originally proposed in the context of multi-dimensional real-valued data, but its applicability is broader. The main characteris­tic of GENHIST is that it allows buckets to overlap in the space of multi-dimensional values: the algorithm starts from a uniform grid partitioning of the space and then iteratively enlarges the buckets that contain high numbers of data elements. This has two effects: first, the density of data in each bucket decreases, thus mak­ing the overall density smoother; second, the buckets end up overlapping, thus creating many more distinct areas than there are buckets per se. The data distribu­tion approximation within each area is a combination of what all the overlapping bucket that form the area indicate. This results in a small number of buckets producing approximations with low errors.
Another alternative is the STHoles Histogram [11], which takes, in some sense, a dual approach to GEN­HIST: instead of the region covered by a bucket in­creasing in size and overlapping with other buckets, in STHoles, this region may decrease in size due to the removal of a piece of it (i.e., opening a hole) that forms a separate, child bucket. This creates buckets that are not solid rectangles, and is therefore capable of capturing quite irregular data distributions.
Identifying effective multi-dimensional partition rules is by no means a closed problem, with different approaches being proposed continuously [23].
Value Approximation Within Each Bucket
Given a specific amount of space for a histogram, one of the main tradeoffs is the number of buckets ver­sus the amount of information kept in each bucket.
A small amount of information within each bucket im­plies gross local approximations but also more buckets. Finding the right balance in this tradeoff to optimize the overall approximation of the data distribution is a key question.
With respect to approximating the set of values that fall in a 1-dimensional bucket, there have been essen­tially two approaches. Under the traditional continu­ous value assumption, one maintains the least amount of information (just the min and max value), but noth­ing that would give some indication of how many val­ues there are or where they might be. Under the more recent uniform spread assumption [67], one also main­tains the number of values within each bucket and approximates the actual value set by the set that is formed by (virtually) placing the same number of val­ues at equal distances between the min and max value. A different version of that has also been proposed that does not record the actual average spread within a bucket but one that reduces the overall approximation error in range queries by taking into account the pop­ularity of particular ranges within each bucket [46]. There have been several studies that show each gen­eral technique superior to the other, an indication that there may be no universal winner.
The two main approaches mentioned above have been extended for multi-dimensional buckets as well, maintaining the min and max value of each dimension in the bucket. Under the continuous value assump­tion nothing more is required, but under the uniform spread assumption, the problem arises of which dis­tinct (multi-dimensional) values are assumed to exist in the bucket. If di is the number of distinct values in attribute Xi that are present in a bucket and v'^k) is the k'th approximate value in dimension i (obtained by applying the uniform spread assumption along that dimension), then a reasonable approach is to assume that all possible combinations < v[(ki), ..,v'n(kn) >, 1 < ki < di, exist in the bucket [66].
There has also been an interesting effort that in­troduces the use of kernel estimation into the 1-dimensional histogram world [10] to deal specifically with real-valued data. Roughly, it suggests choosing the points of considerable change in the probability density function as the bucket boundaries (in a spirit similar to the maxdiff partition constraint) and then applying the traditional kernel estimation method for approximating the values within each bucket. This has also been generalized for the multi-dimensional case [31].
Frequency Approximation Within Each Bucket
With respect to approximating the set of frequencies that fall in a bucket, almost all efforts deal with the traditional uniform distribution assumption. Among the few exceptions is one that is combined with the linear spline partition constraint mentioned above and
uses a linear spline-based approximation for frequen­cies as well [46]. It records one additional data item per bucket to capture linearly growing or shrinking frequencies at the expense of fewer buckets for a fixed space budget. Likewise, another exception uses equally small additional space within each bucket to store cumulative frequencies in a 4-level tree index [13]. Contrary to the previous effort, however, it is combined with some of the established partition con­straints, i.e., v-optimal and maxdiff.
Efficient and Dynamic Constructions
Although estimation effectiveness is probably the most important property of histograms (or any other com­pression/estimation method for that matter), con­struction cost is also a concern. With respect to this aspect, histograms may be divided into two categories: static histograms and dynamic/adaptive histograms.
Static histograms are those that are traditionally used in database systems: after they are constructed (from the stored data or a sample of it), they remain unchanged even if the original data gets updated. De­pending on the details of the updates, a static his­togram eventually drifts away from what it is sup­posed to approximate, and the estimations it produces may suffer from increasingly larger errors. When this happens, the administrators ask for a recalculation, at which point the old histogram is discarded and a new one is calculated afresh. An important consideration for static histograms is the cost of each calculation itself, which is mostly affected by the partition con­straint. Most such constraints (e.g., equi-sum, maxd­iff, compressed) have straightforward calculations that are efficient. This is not the case, however, for what has been shown to be the most effective constraint, i.e., v-optimal, whose straightforward calculation is in general exponential in the number of source-parameter values. A key contribution in this direction has been the proposal of a dynamic-programming based algo­rithm that identifies the v-optimal histogram (for any sort and source parameter) in time that is quadratic in the number of source-parameter values and linear in the number of buckets, thus making these histograms practical as well [42]. Subsequently, several (mostly theoretical) efforts have introduced algorithms that have reduced the required running time for calculat­ing these optimal histograms, eventually bringing it down to linear overall and achieving similar improve­ments for the required space as well [30]. Dynamic-programming algorithms have also been proposed for constructing the optimal histograms for (hierarchical) range queries in OLAP data [44]. For the multi­dimensional case, optimal histogram identification is NP-hard, so several approximate techniques have been proposed [59].
Another interesting development has been the pro­posal of algorithms to identify optimal sets of his-
tograms (as opposed to individual histograms), based on an expected workload [40]. This effort focuses on v-optimal histograms, but is equally applicable to other partition constraints as well.
Even with the existence of efficient calculation al­gorithms, however, static histograms suffer from in­creasing errors between calculations. Moreover, in a data stream environment, static histograms are not an option at all, as there is no opportunity to store the incoming data or examine it more than once. Hence, several works have proposed various ap­proaches to dynamic/adaptive/self-tuning histograms, which change as the data gets updated, while remain­ing competitive to their static counterparts. Among these, we note one for equi-depth and compressed his­tograms [26], one for v-optimal histograms [27], and one for (linear) spline-based histograms [46]. There is also an effort focusing on data streams, where a sketch on the (joint) data distribution of the stream is maintained, from which an effective multidimensional histogram may be constructed [72]; the STHoles his­togram is used for experimentation with the method, but in principle, it could be applied to other histogram classes as well.
Another approach to dynamic construction that has been examined in the past consists of query feed­back mechanisms that take into account actual sizes of query results to dynamically modify histograms so that their estimates are closer to reality. In essence, this is histogram adaptation at query time instead of at update time. The main representatives in this cat­egory are the ST-histograms [2] and their descendant STHoles histograms [11], which employ a sophisticated partition rule as well. These techniques are indepen­dent of the particular characteristics of the initial his­tograms, which may be constructed in any way, e.g., they could be equi-depth histograms. In addition to their dynamic nature, a key advantage of these ap­proaches is their low cost.
The LEO system [70] generalizes these efforts as it uses result sizes of much more complicated queries to modify its statistics, including join and aggregate queries, queries with user-defined functions, and oth­ers. Interestingly, LEO does not update the statistics in place, but puts all feedback information into sepa­rate catalogs, which are used in combination with the original histograms at estimation time.
Error Guarantees
Most work on histograms deals with identifying those that exhibit low errors in some estimation problem, but not with providing, together with the estimates, some information on what those errors might be. The first work to address the issue [42] suggests storing in each bucket the maximum difference between the actual and the approximate (typically, the average) frequency of a value in the bucket and using that to
provide upper bounds on the error of any selectivity estimates produced by the histogram for equality and range selection queries. An interesting alternative fo­cuses on optimizing top-N range queries and stores ad­ditional information on a per-histogram rather than on a per-bucket basis [20].
Other Data Types
As mentioned earlier, most work on histograms has fo­cused on approximating numeric values, in one or mul­tiple dimensions (attributes). Nevertheless, the need to approximation is much broader, and several efforts have examined the use of histograms for other data types as well.
With respect to spatial data, the canonical ap­proaches to 2-dimensional histograms do not quite work out as these are for point data and do not extend to objects that are 2-dimensional themselves. Further­more, frequency is usually not an issue in spatial data, as spatial objects are not repeated in a database. Sev­eral interesting techniques have been presented to ad­dress the additional challenges, which essentially are related to the partition rule, i.e., how the spatial ob­jects are grouped into buckets. Some form buckets by generalizing conventional histogram partition con­straints while others do it by following approaches used in spatial indices (e.g., R-trees). The MinSkew His­togram [5] is among the more sophisticated ones and divides the space by using binary partitionings (recur­sively dividing the space along one of the dimensions each time) so that the overall spatial skew of all buckets is minimized. The latter captures the variance in the density of objects within each bucket, so it follows, in some sense, the spirit of the v-optimal histograms. The SQ-Histogram [3] is an interesting alternative, divid­ing the space according to the Quad-tree rule (which is more restrictive than arbitrary binary partitionings) and, in addition to spatial proximity, taking into ac­count proximity in the size as well as the complexity (number of vertices) of the polygons that are placed in the same bucket. Both approaches are quite effective, with SQ being probably the overall winner. Spatial histograms, i.e., MinSkew histograms, have also been extended to capture the velocity of object movement, thus becoming able to approximate spatio-temporal data as well [17].
The recent interest in XML could not, of course, leave untouched XML file approximation, XML query result size estimation, and other related problems. The semi-structured nature of XML files does not lend itself to histogram-based approximation, as there is no immediate multi-dimensional space that can be bucketized but one needs to be formed from some numeric XML-file characteristics. In the StatiX ap­proach [22], information in an XML Schema is used to identify potential sources of structural skew and then 1-dimensional histograms are built for the most
problematic places in the schema, approximating the distributions of parent ids for different elements. In the XPathLearner approach [49], (first-order) Markov Histograms are used [1], where the frequencies of the results of traversing all paths of length 2 are stored in two 2-dimensional histograms. The dimen­sions always represent the 'from' and 'to' nodes of the paths in the XML graph; in the first histogram, both nodes/dimensions are for XML tags, whereas in the second histogram, the 'from' node/dimension is an XML tag and the 'to' node/dimension is a value. Assuming enough memory, frequencies are maintained accurately for all tag-to-tag pairs (accu­rate histogram), as there are very few. To the con­trary, pairs are placed in a histogram that is based on a 2-dimensional version of the com­pressed partition constraint, with frequency as the source parameter. Another approach for estimating XML-query result sizes builds position histograms on a 2-dimensional space as well, only here the two di­mensions are directly or indirectly related to the num­bering of each node in a preorder traversal of the XML graph [79]. Finally, histograms have also been used in combination with or as parts of other data structures for XML approximations. The XSketch is quite an effective graph-based synopsis that tries to captured both the structural and the value characteristics of an XML file [63, 64]. Histograms enter the picture as they are used at various parts of an XSketch to cap­ture statistical correlations of elements and values in particular neighborhoods of the XSketch graph.
In addition to XML graphs, histograms have also been proposed to capture the degrees of the nodes in general graphs as a way to compare graphs between them and grade their similarity [60].
Unconventional Histograms
Throughout the years, there have been a few interest­ing pieces of work that do not quite follow the gen­eral histogram taxonomy or histogram problem defi­nitions. One of them suggests the use of the Discrete Cosine Transform (DCT) to compress an entire multi­dimensional histogram and store its compressed form [48]. It employs a very simple multi-dimensional parti­tion rule (a uniform grid over the entire space), divides the space into a large number of small buckets, and then compresses the bucket information using DCT. This appears to save on space but also estimation time, as it is possible to recover the necessary information through the integral of the inverse DCT function.
There is also a promising line of work that combines histograms with other techniques to produce higher-quality estimations than either technique could do alone. In addition to several such combinations with sampling, a particularly interesting technique tries to overcome the 'curse of dimensionality' by identify­ing the critical areas of dependence and independence
among dimensions in multi-dimensional data, captur­ing them with a statistical interaction model (e.g., log-linear model) which can then form the basis for lower-dimensional MHIST histograms to approximate the overall joint data distribution [19].
Finally, there is a very interesting departure from the convention that histograms are built on base rela­tions and estimations of the data distributions of in­termediate query results are obtained by appropriate manipulations of these base-relation histograms [12]. It discusses the possibility of maintaining histograms on complex query results, which proves to be quite ef­fective in some cases. This work uses the main SQL Server histograms (essentially maxdiff - see Section 4) to demonstrate the proposed approach, but the overall effort is orthogonal to the particular histogram class. As the number of potential complex query histograms is much larger than that of base-relation histograms, the corresponding database design problem of choos­ing which histograms to construct is accordingly more difficult as well. Fortunately, a workload-based algo­rithm proves adequate for the task.
4 Industrial Presence of Histograms
Histograms have not only been the subject of much research activity but also the favorite approximation method of all commercial DBMSs as well. Essentially all systems had equi-width histograms in the begin­ning and then eventually moved to equi-depth his­tograms. In this section, we briefly describe the cur­rently adopted histogram class for three of the most popular DBMSs.
DB2 employs compressed histograms with value as the sort parameter and frequency as the source pa­rameter [50]. Users may specify the numbers of single­ton and non-singleton buckets desired for the most-frequent values and the equi-depth part of a com­pressed histogram, respectively, with the default be­ing 10 and 20. A departure from our general descrip­tions above is that DB2 stores cumulative frequencies within non-singleton buckets. Histogram construction is based on a reservoir sample of the data. DB2 ex­ploits multi-dimensional cardinality information from indices on composite attributes (whenever they are available) to obtain some approximate quantification of any dependence that may exist between the at­tributes, and uses this during selectivity estimation. Otherwise, it assumes attributes are independent. The learning capabilities of LEO [70] play a major role in how all available information is best exploited for high-quality estimation.
Oracle still employs equi-depth histograms [78]. Its basic approach to multi-dimensional selectivities is similar to that of DB2, based on exploiting any avail­able information from composite indices. In addition to that, however, it offers dynamic sampling capabil­ities to obtain on-the-fly dependence information for
rather complex predicates whenever needed (selections and single-table functions are already available, while joins will be in the next release). It also takes into account the dependencies that exist between the at­tributes of a cube's dimensions' hierarchies during roll-up and provides estimates at the appropriate hierar­chy level. Finally, the next release will employ learning techniques to remember selectivities of past predicates and use them in the future.
SQL Server employs maxdiff histograms with value as the sort parameter and essentially average frequency (within each bucket) as the source parameter [9]. It permits up to 199 buckets, storing within each bucket the frequency of the max value and (essentially) the cumulative frequency of all values less than that. His­togram construction is typically based on a sample of the data. Composite indices are used in a similar fashion as in the other systems for obtaining multi­dimensional selectivity information.
Note that all commercial DBMSs have implemented strictly 1-dimensional histograms. Except for some in­cidental indirect information, they essentially still em­ploy the attribute value independence assumption and have not ventured off to multi-dimensional histograms.
5 Competitors of Histograms
The main technique that has competed against his­tograms in the past decade is wavelets, which is very important for image compression and has been intro­duced into the database world in the late 90's [7, 56]. Wavelets have been used extensively for approximate answering of different query types and/or in differ­ent environments: multidimensional aggregate queries (range-sum queries) in OLAP environments [75, 76], aggregate and non-aggregate relational queries with computations directly on the stored wavelet coeffi­cients [14], and selection and aggregate queries over streams [28]. As with histograms, there have also been efforts to devise wavelet-based techniques whose ap­proximate query answers are provided with error guar­antees [24], as well as to construct and maintain the most important wavelet coefficients dynamically [57].
Sampling is not a direct competitor to histograms, as it is mostly a runtime technique, and furthermore, the literature on sampling is extremely large, so it is impossible to analyze the corresponding highlights in the limited space of this paper. However, we should emphasize that sampling is often a complementary technique to histograms, as static (and even several forms of dynamic) histograms are usually constructed based on a sample of the original data [15, 26, 62].
There are also several specialized techniques that have been proposed and compete with histograms on specific estimation problems. These include techniques for selectivity estimation of select-join queries [71] or spatial queries [8], using query feedback to modify stored curve-fitting/parametric information for bet-
ter selectivity estimation [16], selectivity estimation for alphanumeric/string data in 1-dimensional [43, 45] and multi-dimensional environments [41, 77], identifi­cation of quantiles [6, 53, 54] and their dynamic main­tenance with a priori guarantees [29], approximate query answering for aggregate join queries [4], select-join queries [25], and within the general framework of on-line aggregation [33, 32, 51], computing frequencies of high-frequency items in a stream [52], and others. Despite their suboptimality compared to some of these techniques on the corresponding problems, histograms remain the method of choice, due to their overall ef­fectiveness and wide applicability.
6 The Future of Histograms
Despite the success of histograms, there are several problems whose current solutions leave enough space for significant improvement and several others that re­main wide open, whose solution would make the ap­plicability of histograms much wider and/or their ef­fectiveness higher. We have recently discussed various problems of both types [35], some addressing specific histogram characteristics from the existing taxonomy while others being cast in a slightly more general con­text. In this section, we focus on three of the open problems, those that we believe are the most promis­ing and sense as being the furthest away from any past or current work that we are aware of.
Histogram Techniques and Clustering
Abstracting away the details of the problem of histogram-based approximation, one would see some striking similarities with the traditional problem of clustering [73]: the joint data distribution is parti­tioned into buckets, where each bucket contains similar elements. Similarity is defined based on some distance function that takes into account the values of the data attributes and the value of the frequency if there is any variation on it (e.g., if it is not equal to 1 for all data elements). The buckets are essentially clusters in the traditional sense, and for each one, a very short approximation of the elements that fall in it is stored. Despite the similarities, the techniques that have been developed for the two problems are in general very different, with no well-documented reasoning for many of these differences. Why can't the histogram techniques that have been developed for selectivity es­timation be used for clustering or vice versa? From another perspective, why can't the frequency in selec­tivity estimation be considered as another dimension of the joint data distribution and have the problem be considered as traditional clustering? What would the impact be of using stored approximations developed for one problem to solve another? In general, given the great variety of techniques that exist for the two prob­lems, it is crucial to obtain an understanding of the advantages and disadvantages of each one, its range of
applicability, and in general, their relative characteris­tics when mutually compared. A comprehensive study needs to be conducted that will include several more techniques than those mentioned here. The "New Jer­sey Data Reduction Report" [7] has examined many techniques and has produced a preliminary compar­ison of their applicability to different types of data. It can serve as a good starting point for verification, extrapolation, and further exploration, not only with respect to applicability, but also precise effectiveness trade-offs, efficiency of the algorithms, and other char­acteristics.
Bucket Recognition and Representation
The goal of any form of (partition-based) approxima­tion, e.g., histogram-based and traditional clustering, is to identify groups of elements so that all those within a group are similar with respect to a small number of parameters that characterize them. By storing ap­proximations of just these parameters, one is able to reconstruct an approximation of the entire group of elements with little error. Note that, in the terms of the histogram taxonomy, these parameters should be chosen as the source parameter(s), to satisfy the proximity-expressing partition constraint.
How do we know which parameters are similar for elements so that we can group them together and rep­resent them in terms of them? This is a typical ques­tion for traditional pattern recognition [73], where be­fore applying any clustering techniques, there is an earlier stage where the appropriate dimensions of the elements are chosen among a great number of possi­bilities. There are several techniques that make such a choice with varying success depending on the case.
It is important, however, to emphasize that, in prin­ciple, these parameters may not necessarily be among the original dimensions of the data elements presented in the problem but may be derivatives of them. For example, in several histogram-based approximations as we have described them above, proximity is sought directly for frequencies but not for attribute values, as attention there is on their spreads. (Recall also the success of area as a source parameter, which is the product of frequency with spread.) The frequencies in a bucket are assumed constant and require a smaller amount of information to be stored for their approxi­mation than the attribute values, which are assumed to follow a linear rule (equal spread). Hence, con­ventional histogram-based approximation, under the uniform distribution and uniform spread assumptions, implies clustering in the derived space of frequency and spread. In principle, however, not all data distri­butions are served best with such an approach.
To increase the accuracy of histogram approxima­tions, there should be no fixed, predefined approxi­mation approach to the value dimensions and the fre­quencies. It should not necessarily even be the same
for different buckets. Histograms should be flexible enough to use the optimal approximation for each di­mension in each bucket, one that would produce the best estimations for the least amount of information. Identifying what that optimal approximation is, is a hard problem and requires further investigation.
Histograms and Tree Indices
The fact that there is a close relationship between ap­proximate statistics kept in databases, especially his­tograms, and indices has been recognized in the past in several works [7]. If one considers the root of a B+ tree, the values that appear in it essentially partition the attribute on which it is built into buckets with the corresponding borders. Each bucket is then further subdivided into smaller buckets by the nodes of the subsequent level of the tree. One can imagine storing the appropriate information next to each bucket spec­ified in a node, hence transforming the node into a histogram, and the entire index into a so called hi­erarchical histogram. This may adversely affect in­dex search performance, of course, as it would reduce the out-degree of the node, possibly making the tree deeper. Nevertheless, although this idea works against the main functionality of an index, its benefits are non-negligible as well, so it has even been incorporated into some systems.
We believe that hierarchical histograms and, in gen­eral, the interaction between approximation structures and indices should be investigated further, as there are several interesting issues that remain unexplored as analyzed below. Consider again a B+ tree whose nodes are completely full. In that case, the root of the tree specifies a bucketization of the attribute domain that corresponds to an equi-depth histogram, i.e., each bucket contains roughly an equal number of elements under it. Similarly, any node in the tree specifies an equi-depth bucketization of the range of values it leads to.
The main issue with B+ trees being turned into hierarchical equi-depth histograms is that the latter are far from optimal overall on selectivity estimation [67]. Histograms like v-optimal and maxdiff are much more effective. What kind of indices would one get if each node represented bucketizations following one of these rules? Clearly, the trees would be unbalanced. This would make traditional search less efficient on the average. On the other hand, other forms of searches would be served more effectively. In particular, in a system that provides approximate answers to queries, the root of such a tree would provide a higher-quality answer than the root of the corresponding B+ tree. Furthermore, the system may move in a progressive fashion, traversing the tree as usual and providing a series of answers that are continuously improving in quality, eventually reaching the leaves and the final, accurate result.
Returning to precise query answering, note that typically indices are built assuming all values or ranges of values being equally important. Hence, having a balanced tree becomes crucial. There are often cases, however, where different values have different impor­tance and different frequency in the expected work­loads [46]. If this query frequency or some other such parameter is used in conjunction with advanced his­togram bucketization rules, some very interesting trees would be generated whose average search performance might be much better than that of the B+ tree.
From the above, it is clear that the interaction be­tween histograms and indices presents opportunities but also several technical challenges that need to be investigated. The trade-off between hierarchical his­tograms that are balanced trees with equi-depth buck­etization and those that are unbalanced with more ad­vanced bucketizations requires special attention. The possibility of some completely new structures that would strike even better trade-offs, combining the best of both worlds, cannot be ruled out either.
7    Conclusions
Histograms have been very successful within the database world. The reason is that, among several ex­isting competing techniques, they probably represent the optimal point balancing the tradeoff between sim­plicity, efficiency, effectiveness, and applicability for data approximation/compression. Research-wise most of the basic problems around histograms seem to have been solved, but we believe there are still better so­lutions to be found for some of them. Moreover, as outlined in the previous section, there are some un­touched foundational problems whose solution may re­quire significant changes in our overall perspective on histograms. As much as the past ten years have been enjoyable and productive in deepening our collective understanding of histograms and applying them in the real world, we believe the next ten will be even more exciting and really look forward to them!
8    Personal History
Our personal history with histograms has been strongly influenced by Stavros Christodoulakis. It all started during the "Query Optimization Workshop", which was organized in conjunction with SIGMOD'89 in Portland, when Stavros argued that optimizing very large join queries did not make any sense, as the er­rors in the selectivity estimates would be very large after a few joins. Wanting to prove him wrong due to a personal interest in large query optimization, we started collaborating with him on the error propaga­tion problem, work that led to results that justified Stavros' fears completely [36]. During this effort, we were initiated by Stavros into the wonderful world of majorization theory, Schur functions, and all the other
mathematical tools that much of our subsequent his­togram work would be based on. Further collaboration with Stavros resulted in the identification of "serial his­tograms" and the first realization of their significance [37], which was the springboard for the VLDB'93 pa­per. For all these, we want to express our sincere grat­itude to Stavros for revealing an exciting research path that was hiding many treasures along the way.
The second person who has marked significantly our involvement with histograms is Vishy Poosala. As a PhD student at Wisconsin, Vishy took the origi­nal "serial histogram" results, dived deep into them, and pushed them in many different directions, an ef­fort that eventually led us to several interesting results that have played an important role in the success of histograms. For the long and fruitful collaboration we have had, both before and after his PhD degree, many thanks are due to Vishy as well.
Acknowledgements: For this present paper, we would like to thank Minos Garofalakis, Neoklis Poly-zotis, and again Vishy Poosala for several useful sug­gestions and for verifying that the error in the approxi­mation of the history of histograms presented is small.
References
[I]   Aboulnaga A., Alameldeen A., Naughton J.: Estimat­ing the Selectivity of XML Path Expressions for Internet Scale Applications. VLDB Conf. (2001) 591-600
[2] Aboulnaga A., Chaudhuri S.: Self-tuning Histograms:
Building Histograms Without Looking at Data. SIG-
MOD Conf. (1999) 181-192 [3] Aboulnaga A., Naughton J.: Accurate Estimation of
the Cost of Spatial Selections ICDE 2000 123-134 [4] Acharya S., Gibbons P., Poosala V., Ramaswamy S.:
Join Synopses for Approximate Query Answering. SIG-
MOD Conf. (1998) 275-286 [5] Acharya S., Poosala V., Ramaswamy S.: Selectivity
Estimation in Spatial Databases. SIGMOD Conf. (1999)
13-24 [6] Alsabti K., Ranka S., Singh V.: A One-Pass Algorithm
for Accurately Estimating Quantiles for Disk-Resident
Data. VLDB Conf. (1997) 346-355 [7] Barbara D., et al.: The New Jersey Data Reduction
Report. Data Engineering Bulletin 20:4 (1997) 3-45 [8] Belussi A., Faloutsos C.: Estimating the Selectivity of
Spatial Queries Using the 'Correlation' Fractal Dimen­sion. VLDB Conf. (1995) 299-310
[9] Blakeley J., Kline N.: personal communication. (2003) [10] Blohsfeld B., Korus D., Seeger, B.: A Comparison
of Selectivity Estimators for Range Queries on Metric
Attributes. SIGMOD Conf. (1999) 239-250
[II]   Bruno N., Chaudhuri S., Gravano L.: STHoles: A Multidimensional Workload-Aware Histogram. SIG­MOD Conf. (2001) 294-305
[12] Bruno N., Chaudhuri S., Gravano L.: Exploiting Statistics on Query Expressions for Optimization SIG­MOD Conf. (2002) 263-274
[13] Buccafurri F., Rosaci D., Sacca D.: Improving Range Query Estimation on Histograms. ICDE (2002) 628-238
[14] Chakrabarti K., Garofalakis M., Rastogi R., Shim K.: Approximate Query Processing Using Wavelets. VLDB Journal 10:2-3 (2001) 199-223
[15] Chaudhuri S., Motwani R., Narasayya V.: Random Sampling for Histogram Construction: How Much is Enough? SIGMOD Conf. (1998) 436-447
[16] Chen C., Roussopoulos N.: Adaptive Selectivity Esti­mation Using Query Feedback. SIGMOD Conf. (1994) 161-172
[17] Choi Y.-J., Chung C.-W.: Selectivity Estimation for Spatio-Temporal Queries to Moving Objects. SIGMOD Conf. (2002) 440-451
[18] De Boor C.: A Practical Guide to Splines. Springer (1994)
[19] Deshpande A., Garofalakis M., Rastogi R.: Indepen­dence is Good: Dependency-Based Histogram Synopses for High-Dimensional Data. SIGMOD Conf. (2001) 199-210
[20] Donjerkovic D., Ramakrishnan R.: Probabilistic Opti­mization of Top N Queries. VLDB Conf. (1999) 411-422
[21] Flickner M., et al.: Query by Image and Video Con­tent: The QBIC System. IEEE Computer 28:9 (1995) 23-32
[22] Freire J., Haritsa J., Ramanath M., Roy P., Simeon J.: StatiX: Making XML Count. SIGMOD Conf. (2002) 181-191
[23] Furtado P., Madeira H.: Summary Grids: Building Accurate Multidimensional Histograms. DASFAA Conf. (1999) 187-194
[24] Garofalakis M., Gibbons P.: Wavelet Synopses with Error Guarantees. SIGMOD Conf. (2002) 476-487
[25] Getoor L., Taskar B., Koller D.: Selectivity Estima­tion Using Probabilistic Models. SIGMOD Conf. (2001) 461-472
[26] Gibbons P., Matias Y., Poosala V.: Fast Incremental Maintenance of Approximate Histograms. VLDB Conf. (1997) 466-475
[27] Gilbert A., Guha S., Indyk P., Kotidis Y., Muthukr-ishnan S., Strauss M.: Fast, Small-Space Algorithms for Approximate Histogram Maintenance. ACM STOC (2002) 389-398
[28] Gilbert A., Kotidis Y., Muthukrishnan S., Strauss M.: Surfing Wavelets on Streams: One-Pass Summaries for Approximate Aggregate Queries. VLDB Conf. (2001) 79-88
[29] Gilbert A., Kotidis Y., Muthukrishnan S., Strauss M.: How to Summarize the Universe: Dynamic Maintenance of Quantiles. VLDB Conf. (2002) 454-465
[30] Guha S., Indyk P., Muthukrishnan S., Strauss M.: Histogramming Data Streams with Fast Per-Item Pro­cessing. ICALP Conf. (2002) 681-692
[31] Gunopulos D., Kollios G., Tsotras V., Domeniconi C.: Approximating Multi-Dimensional Aggregate Range Queries Over Real Attributes. SIGMOD Conf. (2000) 463-474
[32] Haas P., Hellerstein J.: Ripple Joins for Online Ag­gregation. SIGMOD Conf. (1999) 287-298
[33] Hellerstein J., Haas P., Wang H.: Online Aggregation. SIGMOD Conf. (1997) 171-182
[34] Ioannidis Y.: Universality of Serial Histograms. VLDB Conf. (1993) 256-267
[35] Ioannidis Y.: Approximations in Database Systems. ICDT (2003) 16-30
[36] Ioannidis Y., Christodoulakis S.: On the Propagation of Errors in the Size of Join Results. SIGMOD Conf. (1991) 268-277
[37] Ioannidis Y., Christodoulakis S.: Optimal Histograms for Limiting Worst-Case Error Propagation in the Size of Join Results. ACM TODS 18:4 (1993) 709-748
[38] Ioannidis Y., Poosala V.: Balancing Histogram Opti-mality and Practicality for Query Result Size Estima­tion. SIGMOD Conf. (1995) 233-244
[39] Ioannidis Y., Poosala V.: Histogram-Based Approx­imation of Set-Valued Query-Answers. VLDB Conf. (1999) 174-185
[40] Jagadish H. V., Jin H., Ooi В. С, Tan K.-L.: Global Optimization of Histograms. SIGMOD Conf. (2001) 223-234
[41] Jagadish H. V., Kapitskaia O., Ng R., Srivastava D.: Multi-Dimensional Substring Selectivity Estima­tion. VLDB Conf. (1999) 387-398
[42] Jagadish H. V., Koudas N., Muthukrishnan S., Poos­ala V., Sevcik K., Suel T.: Optimal Histograms with Quality Guarantees. VLDB Conf. (1998) 275-286
[43] Jagadish H. V., Ng R., Srivastava D.: Substring Se­lectivity Estimation. PODS Symposium (1999) 249-260
[44] Koudas N., Muthukrishnan S., Srivastava D.: Opti­mal Histograms for Hierarchical Range Queries. PODS Symposium (2000) 196-204
[45] Krishnan P., Vitter J., Iyer B.: Estimating Alphanu­meric Selectivity in the Presence of Wildcards. SIGMOD Conf. (1996) 282-293
[46] Konig A., Weikum G.: Combining Histograms and Parametric Curve Fitting for Feedback-Driven Query Result-Size Estimation. VLDB Conf. (1999) 423-434
[47] Kooi R.: The Optimization of Queries in Relational Databases. PhD Thesis, Case Western Reserve Univer­sity (1980)
[48] Lee J.-H., Kim D.-H., Chung C.-W.: Multi-Dimensional Selectivity Estimation Using Compressed Histogram Information. SIGMOD Conf. (1999) 205-214
[49] Lim L., Wang M., Padmanabhan S., Vitter J., Parr R.: XPathLearner: An On-Ling Self-Tuning Markov Histogram for XML Path Selectivity Estimation. VLDB Conf. (2002) 442-453
[50] Lohman G.: personal communication. (2003)
[51] Luo G., Ellmann C., Haas P., Naughton J.: A Scalable Hash Ripple Join Algorithm. SIGMOD Conf. (2002) 252-262
[52] Manku G. S., Motwani R.: Approximate Frequency Counts over Data Streams. VLDB Conf. (2002) 346-357
[53] Manku G. S., Rajagopalan S., Lindsay B.: Approx­imate Medians and Other Quantiles in One Pass and with Limited Memory. SIGMOD Conf. (1998) 426-435
[54] Manku G. S., Rajagopalan S., Lindsay B.: Random Sampling Techniques for Space Efficient Online Compu­tation of Order Statistics of Large Datasets. SIGMOD Conf. (1999) 251-262
[55] Marshall A., Olkin L.: Inequalities: Theory of Ma-jorization and its Applications. Academic Press (1986)
[56] Matias Y., Vitter J., Wang M.: Wavelet-Based His­tograms for Selectivity Estimation. SIGMOD Conf. (1998) 448-459
[57] Matias Y., Vitter J., Wang M.: Dynamic Maintenance of Wavelet-Based Histograms. VLDB Conf. (2000) 101-110
[58] Muralikrishna M., DeWitt D.: Equi-Depth His­tograms for Estimating Selectivity Factors for Multi-Dimensional Queries. SIGMOD Conf. (1988) 28-36
[59] Muthukrishnan S., Poosala V., Suel T.: On Rectangu­lar Partitionings in Two Dimensions: Algorithms, Com­plexity, and Applications. ICDT (1999) 236-256
[60] Papadopoulos A., Manolopoulos Y.: Structure-Based Similarity Search with Graph Histograms. DEXA Work­shop (1999) 174-179
[61] Pass G., Zabih R.: Comparing Images Using Joint Histograms. Multimedia Systems 7 (1999) 234-240
[62] Piatetsky-Shapiro G., Connell C.: Accurate Estima­tion of the Number of Tuples Satisfying a Condition. SIGMOD Conf. (1984) 256-276
[63] Polyzotis N., Garofalakis M.: Statistical Synopses for Graph-Structured XML Databases. SIGMOD Conf. (2002) 358-369
[64] Polyzotis N., Garofalakis M.: Structure and Value Synopses for XML Data Graphs. VLDB Conf. (2002) 466-477
[65] Poosala V., Ioannidis Y.: Estimation of Query-Result Distribution and its Application in Parallel-Join Load Balancing. VLDB Conf. (1996) 448-459
[66] Poosala V., Ioannidis Y.: Selectivity Estimation Without the Attribute Value Independence Assumption. VLDB Conf. (1997) 486-495
[67] Poosala V., Ioannidis Y., Haas P., Shekita E.: Im­proved Histograms for Selectivity Estimation of Range Predicates. SIGMOD Conf. (1996) 294-305
[68] Sitzmann I., Stuckey P.: Improving Temporal Joins Using Histograms. DEXA Conf. (2000) 488-498
[69] Stigler S.: The History of Statistics: The Measure­ment of Uncertainty before 1900. Harvard University Press (1986)
[70] Stillger M., Lohman G., Markl V., Kandil M.: LEO -DB2's LEarning Optimizer. VLDB Conf. (2001) 19-28
[71] Sun W., Ling Y., Rishe N., Deng Y.: An Instant and Accurate Size Estimation Method for Joins and Se­lection in a Retrieval-Intensive Environment. SIGMOD Conf. (1993) 79-88
[72] Thaper N., Guha S., Indyk P., Koudas N.: Dynamic Multidimensional Histograms. SIGMOD Conf. (2002) 428-439
[73] Theodoridis, S., Koutroumbas K.: Pattern Recogni­tion. Academic Press, 2nd edition (2003)
[74] Tufte E.: The Visual Display of Quantitative Infor­mation. Graphics Press (1983)
[75] Vitter J., Wang M.: Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets. SIGMOD Conf. (1999) 193-204
[76] Vitter J., Wang M., Iyer B.: Data Cube Approxima­tion and Histograms via Wavelets. CIKM Conf. (1998) 96-104
[77] Wang M., Vitter J., Iyer B.: Selectivity Estimation in the Presence of Alphanumeric Correlations. ICDE (1997) 169-180
[78] Witkowski A.: personal communication. (2003)
[79] Wu Y., Patel J., Jagadish H. V.: Using Histograms to Estimate Answer Sizes for XML Queries. Information Systems 28:1-2 (2003) 33-59