Snakes and sandwiches: Optimal clustering strategies for a data warehouse.
H. V. Jagadish, Laks V. S. Lakshmanan and Divesh Srivastava.
Physical layout of data is a crucial determinant of performance
in a data warehouse. The optimal clustering of data on disk,
for minimizing expected I/O, depends on the query workload.
In practice, we often have a reasonable sense of the likelihood
of different classes of queries, e.g., 40% of the queries
concern calls made from some specific telephone number in some
month. In this paper, we address the problem of finding an
optimal clustering of records of a fact table on disk, given
an expected workload in the form of a probability distribution
over query classes.
Attributes in a data warehouse fact table typically have
hierarchies defined on them (by means of auxiliary dimension
tables). The product of the dimensional hierarchy levels forms
a lattice and leads to a natural notion of query classes.
Optimal clustering in this context is a combinatorially
explosive problem with a huge search space (doubly exponential
in number of hierarchy levels). We identify an important
subclass of clustering strategies called lattice paths, and
present a dynamic programming algorithm for finding the optimal
lattice path clustering, in time linear in the lattice size.
We additionally propose a technique called snaking, which when
applied to a lattice path, always reduces its cost. For a
representative class of star schemas, we show that for every
workload, there is a snaked lattice path which is globally
optimal. Further, we prove that the clustering obtained by
applying snaking to the optimal lattice path is never much
worse than the globally optimal snaked lattice path clustering.
We complement our analyses and validate the practical utility
of our techniques with experiments using TPC-D benchmark data.