INTRODUCTION
This paper presents efficient ways for executing
iceberg queries. An iceberg query computes an aggregate function
over an attribute ( or set of attributes) in order to find aggregate values
above a specified threshold.
The prototypical iceberg query the paper considers (can be easily extended
to the other forms of iceberg queries) is :
SELECT attr1, attr2, ...,
attrk, COUNT( rest)
FROM R
GROUP BY attr1, attr2, ..., attrk
HAVING COUNT( rest)
>= T
Where R is a relation that contains attributes attr1, attr2, ..., attrk, rest and T is a threshold.
COMMON USES
Iceberg queries can be found
in data warehousing, data mining, information retrieval and copy detection.
To name but a few of its applications :
TECHNIQUES FOR COMPUTING ICEBERG QUERIES
Trivial techniques such as :
All the algorithms use as building blocks well known techniques such as :
The common goal of all the algorithms is to minimize the number of false positives.
They can generally be described using these phases :
1. Compute a candidate set
f using sampling
2. Execute bucket counting
over R with some dependency on f
3. Select the values that
exceed the threshold using another scan of R.
Several optimization methods are presented that use a combination of multiple hash functions and multiple scans over R.
RESULTS
Using a "case study" approach to evaluate the algorithms
empirically, the following conclusion was finalized : Using multiple scans,
reduces the number of false positives significantly, but takes more time.
Thus, the method that used one less scan was in the overall best, because,
not only it produced less false positives, it was also quicker.