by
Yossi Matias, S. Muthukrishnan,
Suleyman
Cenk Sahinalp, Jacob Ziv
Summarized by Genady Garber
Introduction
In this paper presented two algorithmic
problems, one from the area of Data Compression and the other from Information
Retrieval. The main results are highly efficient (linear or near linear
time) algorithms for these problems. All these algorithms rely on the suffix
tree, a versatile data structure in combinatorial pattern matching. Suffix
trees, with suitably simple augmentations, have found numerous applications
in string processing. In these applications too, was augmented the suffix
tree with extra edges and additional information.
Problems & Backgroung
1. The Document
Listing Problem
We are given a set of documents T = (T1, . . . , Tk) for preprocessing.
Given a query pattern P , the problem is to output a list of
all the documents that contain P as a substring. A related query
is where we are required to merely report the number of documents that
contain P. The previously known algorithm that solves this problem
in O(|P|) time is based on data structures for computing
lowest common ancestor (LCA) queries. The document listing problem is of
great interest in information retrieval and has independently been formulated
in many scenarios, for example in discovering gene homologies.
-The Suffix-DAG Data Structure
o Build the suffix-DAG of documents
T1, . . ., Tk,
in O(t) = O(S tk),
using O(t) space
o The suffix-DAG of T, denoted
by SD(T), contains the generalized suffix tree, GST(T), of
the set T at its core.
o A GST(T) is defined to be the
compact trie of all the suffixes of each of the documents
in T.
o Each leaf node l in GST(T) is
labeled with the list of documents which have a suffix
represented by the
path from the root to l.
o The substring represented by
a path from the root to any given node n denoted by P(n)
o The nodes of SD(T) are the nodes
of GST(T)
o The edges of SD(T) are of two
types:
* the skeleton edges of SD(T) are the edges of GST(T)
* the supportive edges of SD(T) are defined as follows:
for any nodes n1 and n2
in SD(T) there is a pointer edge from n1
to n2 if and only if:
- n1 is an ancestor of n2
- among the suffix trees ST(T1), ST(T2), . . . , ST(Tk) there exists at
least one,
say ST(Ti), which has two nodes , n1,i
and n2,i
such that:
# P(n1) = P(n1,i)
,P(n2) = P(n2,i)
# n1,i
is the parent of n2,i
- such an edge is labeled with i for all relevant documents in Ti
o In order to respond to the count
and list queries, one of the standard data structures
that support least
common ancestor (LCA) queries on SD(T) in O(1) time was built.
o Also each internal node n of
SD(T) contains
* array that stores its supportive edges in pre-order fashion
* number of documents which include P(n) as substring
2. The (a,b)-HYZ
Compression Problem
Given a binary string T of length
t. We are asked to replace disjoint blocks (substrings) of size b
with desirably shorter codewords. The codewords are selected in a way that
it would be possible for a corresponding decompression algorithm to compute
the original T out of the string of codewords. This is done as follows.
Say the first i-1 blocks has been compressed. To compute codeword cj
for block j should be determined its context :
the context of block T[i:l] is the longest substring
T[k:i-1], k<i of size at most a such that
T[k:l] occures earlier in T. The codeword cj
is the ordered pair <g,l> where g
is the length of context and l is rank of block
j with respect to the context, according to some predetermined ordering.
The (a,b)HYZ
compression scheme is based on the intuition that similar symbols in data
appear in similar contexts.
-Example
o T = 010011010
o a
= 2 , b
= 1
o the context of block 9 is T[7 : 8] = 01
o the two substrings which follow earlier occurrences
of this context are T[3] = 0
and T[6] = 1
o the lexicographic order of block 9 amount these substrings
is 1
-Theorem 1
There is an algorithm to implement the compression scheme
C(a,b)
which runs in O(tb) time and requires
O(tb) space, independent of a
-Theorem 2
There is an algorithm to implement the compression method
Ca,b
for a = log t and b
= log log t in O(t) time using O(t) space
Proof Sketch
For any descendents wi of v which are b chars apart from
v, the DataStructure enables to compute lexicographic order of the path
between any wi and v, allowing easy computation of codeword of block of
size brepresented by path between v and wi
* The algorithm exploits the fact
that the context size is bounded, and its seeks
similarities
between suffixes of the input up to a small size.
Results and conclusions
Document listing problem
Processing k documents in linear time ( O(Si
ti) ) and space
Time to answer a query with pattern P is
O( |P| logk + out)
(a,b)-HYZ compression problem
1) unbounded a,
complexity O(tb)
Gives linear time for b
= O(1)
The only previously known algorithm, where for
b = O(1), the author
presents an O(ta) time algorithm, and for unbounded
a this running time is O( )
2) a = O(log t), b
= log log t, complexity O(t)
There is no any previously known algorithms