Wave-Indices:
Indexing Evolving Databases
by
Narayanan Shivakumar & Hector Garcia-Molina
Department of Computer Sciense
Stanford
Summarized by Miri Pardo
.
Introduction
The tremendous volumes of data that are today being generated in some applications, require an index into the data of some past window of days. For example Internet search engine may provide an index for the past 30 days of Netnews arcticles.
This paper propose several techniques to build wave indices where the data of a new day can be efficiently added, and old data can be quickly expired, to maintain the required window. It show how these techniques perform under a variety of scenarios, for different volumes of input data and query patterns. These tecniques are being compared based on several system performance measures, such as storage, query response time, and maintenance work, as well as on their simplicty and ease of coding.
Building Wave Indexes
A wave index on a search field F is used to search a collection of days, where each day contains the records generated during a particular time period. The days are partitioned into disjoint clusters and an index on F is built for each. Each individual index is termed a constituent index. The set of constituent indexes is termed the wave index.
One obvious soloution to index a past window of days is to keep a single conventional index, and every day to delete the old data and insert the new batch of data into it. This paper present other interesting way to maintain the required window.
The techniques index a window of W days and partition the data across multiple indexes.
This paper present some ways to build wave index:
( The algorithms can be found here.)
- DEL - Similar to the obvious soloution mentioned above, except that it uses multiple indexes.
- REINDEX - When data of a new day is available the algorithm replace the expired day in the constituent index with the new day by rebuilding the constituent index. The average number of days indexed per transition by REINDEX+ during index build is about half that of REINDEX.
- REINDEX+ - This scheme enhances REINDEX by reducing the average work required in maintain a wave index. REINDEX+ Maintains a temporary index to avoid recomputing the constituent index entries every day.
- REINDEX++ - This scheme improves REINDEX+ by reducing the time to index new data and making new data available sooner for querying. It performs most of the work required in maintaining the wave index before the data is available. For this it uses a few temporary indexes and increase storage requirements. REINDEX++ performs the same amount of work as REINDEX+, but reduces the time to index a new day's data.
- WATA - Wait And Throw Away. This scheme uses a lazy form of deletion by throwing away an entire index only when all its entries have expired. WATA unlike the schemes mentioned above maintain data that is no longer required as part of the window. WATA maintain a soft window.
- RATA - Reindex And Throw Away - A variant of WATA to maintain hard window. RATA is similar except that it uses additional temporary indexes to simulate deleting old entries. RATA perform more work than WATA in order to maintain hard window, However it takes the same time as WATA to index a new day's worth of data after it is available.
Analyzing the Scheme
This paper present some of the advantages of the schemes mentioned above :
- Bulk Insert/Delete : In WATA deletions are performed in bulk by throwing away a whole index. When there is a substantial number of deletes, this may be more efficient than deleting an entry at a time (as in DEL). Similarly, it may be efficient to reindex data, like REINDEX, REINDEX+ and REINDEX++ do, if the index does not cover too many other days.
- Better Structured Index : Reindex may be more costly because it rebuilds indexes from scratch, but this rebuilding can often lead to a better structured index. Such an index could lead to more efficient query processing. There is a trade off between more index build time for better query performance.
- Simple Code : With REINDEX, REINDEX+, REINDEX++, WATA and RATA the scheme do not use complex deletion code. This could be a great advantage if we are implementing our system from scratch.
- Legacy Systems : Some information retrieval indexing packages do not implement delets at all. In such cases DEL can not used .
- Query Performance : Each scheme presented has multiple indexes, this creates more work for queries. However when query volume may be relatively low and data volumes may be high, the high query costs may be amortized by the savings under some of the categories listed above.
Results
The results based on three case studies (copy detection, web engine, and warhousing). For each case there is an application scenario (e.g. datasize, hardware speed). The results are an illustration of performance trends and of the process to follow in selecting a particular wave index scheme.
- REINDEX requires the minimal amount of space. This is because REINDEX maintain packed indexes, and REINDEX does not have any additional temporary indexes like REINDEX+, REINDEX++ or RATA. All schemes require less space as the number of constituent indexes increases.
- The total work done is very sensitive to the mix of queries and updates. For example, if we have many queries in a day, it is best to perform more work at update time in order to obtain an index that is better for queries (e.g., packed and small number of constituent indexes).
- The total work done in case of packed shadowing is significantly less. This is of course because packed shadowing does deletion while copying, and because it is efficient for queries that scan the whole index.
- In application where the queries volumes high, and window size, REINDEX perform the worst. REINDEX perform poorly even when the number of constituent indexes increases since the cost savings of reindexing fewer days in constituent index is offset by the increased cost of queries.
- Before choosing a particular scheme to implement we should consider carefully both whether we may ever want a large window size, and if so, how much do we expect data to increase by.
The results indicate that each of the wave index schemes has advantages and could be useful in some specific scenario, depending on what the centeral performance metrics are .
An idea for future research :How different wave indices perform when multiple disks and multiple processors are used.
More Information