Most early mining systems, as hinted before, had
loose coupling between the mining algorithm and the physical =
place
where the data to be processed was stored. Rather than that, mining was
performed mainly on file systems, and specialized data structures and =
algorithms
were devised for them. It has only been lately proposed to try and =
integrate
mining with databases, including adding extensions to known query =
languages
(mainly SQL) to support mining. Hence, DMQL and M-SQL were born.
But why use a query language to implement mining
algorithms in the first place? Using an existing query language has =
several
advantages:
= The main architecture, which is suggested by the article writers, has the = following layout: The mining procedure will be expressed using extended SQL or a graphical language. The procedure will be processed by a = Preprocessor + Optimizer, which will generate code in SQL-92 for a relational = engine, or generate code in SQL-OR for object relational engine. This = architecture will be named as the SQL architecture, and it will be compared = with the following alternate architectures:
Read directly from DBMS: Data is read in tuples from the DBMS to the mining process using a cursor. There are two variations for this approach: loose coupling, in which the DBMS runs in a different address space from the mining process, and stored procedure, = where the mining process is implemented as a stored procedure, so it is = executed withing the address space of the DBMS (this saves process context = switching in the CPU). In both flavors, the mining results are stored back in the DBMS.
Cache mine: This option is a variation of the stored procedure approach presented above. After reading the entire data from the DBMS, it is cached temporarily on the local disk. This data is discarded once the mining process terminates.
User defined function (UDF): The mining algorithm is expressed
as a collection of user defined functions (UDFs). A UDF is simply
an SQL procedure that returns a scalar value. Note that here all the =
processing
is done by the UDFs, and little usage is made of the DBMS advantages =
detailed
above (the DBMS simply provides the input to the UDFs). This =
architecture
can be seen as an extreme case of the SQL-OR approach in the SQL =
architecture
(in the SQL architecture, the UDFs are light weight).
The Problem
The problem that is to be solved by the = different architectures defined above, is the association rule mining = problem. An association rule can be defined as follows: Given a set of = transactions (a transaction is a set of items), the expresssion X->Y is an = association rule, where X and Y are sets of items. The intuitive meaning of the rule is that transactions which contain items in X tend to also = contains items from Y. An example of such a rule might be: "30% of = transactions that contain beer also contain diapers; 2% of all transactions = contain both these items". 30% is called the confidence of the rule, = while 2% is refered to as the support of the rule.
The association rule mining problem is therefore defined as finding all the rules that satisfy a user defined minimum = support and minimum confidence. The problem can be decomposed into 2 sub = problems:
The Solution
The solution to be used is the Apriori = algorithm: the algorithm performs several passes over the data. At each pass we = have 2 phases: candidate generation, in which we create a superset = Ck of all frequent k-itemsets (sets with k items), and support = counting, in which we determine which of the elements in Ck is highly popular in the real data. The most popular itemsets will compose = Fk+1 for the next pass.
The Implementation
We represent the input of the real data in a = 2-column transaction table, where we keep pairs of tid (transaction identifier) and item (item identifier).
The first part of the Apriori algorithm, candidate generation, can be easily implemented using SQL-92: We perform a = join step, in which we make a k-way join of Fk-1 with itself (SQL join) to obtain Ck, and then perform the prune step, in which we remove all elements from Ck, in which = exists at least one subset who is not one of Fk-1 elements (the = logic behind this step reasons that popular sets of size k must merge from = previous popular items of size k-1. Thus, we cannot be popular if our "parents" were unpopular).
The 2nd part of the algorithm (support counting) is more time consuming. Several implementations (the first two are SQL-92 based, the others are SQL-OR based) will be represented, with a short discussion of each one, and a summary on the performance tests that were conducted to compare between them.
K-way joins: In the k-th pass, we perform a join of = Ck with k transaction tables, and perform a group by on the resulting itemsets (to count the most popular results). For k=3D2 we can use a = small optimization which replaces C2 with the 2-way join of = F1.
Subquery-based: Here we use the common prefixes in = Ck (don't forget that they "share" the popular sets they were created from ) to reduce the number of necessary joins. In the l-th subquery, we find all tids that match the distinct itemsets from the first l = columns of Ck. The output is then joined with T to obtain the result of the subquery. For pass-2 the special optimization of the K-way joins is used.
Note: both approaches above, which rely strictly on SQL-92 (without object relational extensions), were found to be much slower than the SQL-OR solutions to be presented below, so they will be omitted from the discussion that comes afterwards. The SQL-OR = implementations are detailed below:
GatherJoin: Generates all k-item combinations of items =
contained
in a transaction, joins them with Ck, and perform group by on
the join result. In order to do so, we use here 2 table functions (a =
table
function is a virtual table which generates tuples on the fly; it is an
OR extension): Gather, which generates for each transaction an item list
of the items contained in it, and Comb-K, which returns all k-item =
combinations
of a transaction. We then perform the join with Ck, and group
by on the result. We also perform the special pass-2 optimization by =
joining
F1 with itself in order to obtain C2.
GatherCount: This is a variation of the GatherJoin approach,
in which we perform the group by operation inside the table function for
the second pass. In order to do so, we represent C2 as a =
2-dimensional
array. Performing this conversion for k>2 is more complicated, as it =
would
force us to maintain Ck in a hash table.
GatherPrune: A variation of GatherJoin in which the join with Ck is performed inside the table function (so the no. of = combinations to be joined with Ck is reduced). In order to do so, however, we need to convert Ck into a BLOB (Binary Large OBject) and pass it as an argument to the table function.
Horizontal: This is yet another variation of the GatherJoin = approach, in which the Gather table function first transforms the input data into horizontal format (i.e. for each transaction its contained items in the real data), and than continue similarly like GatherJoin.
Vertical: This is a somewhat different approach from = GatherJoin and its variations. Here we create for each item in the data a = BLOB containing all the transactions in which its appears. In the support = counting phase, for each itemset in Ck, all the tids for each item are collected and then counted in the intersection between them.
SQL-OR Comparisons
All the SQL-OR approaches that were given above = were tested using 4 datasets (measuring the time for the Apriori algorithm = execution in the different implementations). The Horizontal approach was far worse than the others, because of its overhead in transforming the data into horizontal format, so it will be omitted from the discussion from here.
The Vertical approach has the best overall = performance from all the others. Note that as opposed to Horizontal, the time to = convert the data here to vertical format is smaller as here we convert the = frequent items which are smaller (in amount) than the number of = transactions.
GatherJoin performance was worse on the higher = passes of the algorithm, because of the large number of combinations that were generated by the table function. It did perform better than Vertical, = however, in some cases in the 2nd pass of the algorithm. In other = cases, GatherCount was the better option.
2 factors that affect the choice between = GatherJoin, GatherCount and Vertical are the number of frequent items (on which the costs of Vertical depends quadratically), and the average number of frequent items per transaction (on which GatherJoin depends).
Architecture Comparisons
Resulting from the conclusions above, the final = SQL-based architecture represented for comparison with the other architectures, is composed as a hybrid scheme which chooses the best of the 3 = approaches GatherJoin, GatherCount and Vertical for each pass of the algorithm, = based on costs estimation.
The tests were conducted using the same datasets that were used for the previous comparisons. The following sections = summarizes different aspects of comparisons between the different architectures.
Timing Comparison: Cache-Mine was found to have the best = performance in most cases, closely followed by the SQL architecture. Stored = procedure and UDF had by far the worst timing performance.
Scale-up comparison: Performing the tests with increasing = number of transactions and increasing average transaction length did not change the picture: all approaches scale linearly in both cases, but Cache mine was leading with the lowest curve.
Space comparison: Here UDF and stored procedure require the = least amount of space. Cache Mine and SQL need as much extra storage as the = data (both for the cache in Cache Mine, and in Vertical, in SQL, which = converts the data into a new format).
Porting: When porting from one OS to another (same DBMS), SQL and UDF architectures are more portable (the other 2 approaches are file system dependant). Porting from one DBMS to another (same OS), however, is harder in SQL and UDF since these approaches rely on SQL-OR features (UDFs, table functions) which are not implemented in every DBMS.
Final notes
The Cache-Mine alternative was found to be, in = most cases, the best alternative to be used. The results do point, however, that using the SQL architecture is a legitimate alternative, that may = have its advantages in several aspects (porting) and by using it we do not = suffer a heavy performance penalty.
In my point of view, what we a little lacking in
the article is a more detailed description of how the other approaches
(Cache Mine, Stored procedure, UDF) are implemented. Having this =
knowledge
would have given a wider understanding as to the meaning of the results
of the final comparison between the architecture (that is, the results
for the SQL architecture can be explained according to the rest of the
article, but the results for the other 3 approaches are given as is, =
without
knowing what implementation was used for each architecture).