Suggested Projects

Algorithms for massive data sets
Internet technologies
Image & video databases

Below is a list of suggested projects, concerning massive data sets and Internet technologies. Further projects with Internet technologies and projects concerning image & video databases may be provided if there is explicit interest.

Algorithms for massive data sets:

Projects 1 & 2: Identifying frequent items in massive data sets

Identifying frequent items in data sets is one of the fundamental components in applications ranging from data mining, decision support systems, data warehousing and query optimizations. The purpose of this umbrella project is to build and optimize essential components of a system for identifying frequent items in massive data sets. The system should ultimately involve a highly efficient execution engine which would be able to deal with large files (100s of Giga bytes and even Tera bytes of data), and a user interface for entering specifications and reporting results.

Project 1:

This project is written from scratch without use of any additional software. The following modules compose the entire system of identifying frequent item sets. Each module will be separate and will interact with the other modules through a well defined API.

User Interface: graphical interface which allows the user to control the parameters of the query including intervening with the query during execution. The interface also displays intermediate and final results of the query execution. Two add-ons (optional):
1. graphical visualization of the results. This involves more graphics and concerns the area of data visualization.
2. providing a client over the Web. This involves HTML, Javascript, CGI.
Query engine: the hard core of the system, Includes implementation of three algorithms that perform the actual query of identifying frequent item sets. The algorithms require usage of sampling techniques, hashing techniques while using storage space in an efficient way. Due to the size of the data, the algorithms use disk as additional intermediate storage space. We assume that moderate disk space is available. An additional, more complex algorithm, is optional, and is designed for the case when intermediate storage is a very scarce resource and must be used in a minimal way.
Data management module: data will reside on the file system. Implementation of this module includes implementation of an advanced and novel sorting module, read and write operations on the data, and providing an interface for the data to the query engine.

Project 2:

This project includes the use of commercialized databases (preferably Informix or Oracle). The following modules compose the entire system of identifying frequent item sets. Each module will be separate and will interact with the other modules through a well defined API.

User Interface: graphical interface which allows the user to control the parameters of the query including intervening with the query during execution. The interface also displays intermediate and final results of the query execution. Two add-ons (optional):
1. graphical visualization of the results. This involves more graphics and concerns the area of data visualization.
2. providing a client over the Web. This involves HTML, Javascript, CGI.
Query engine: the hard core of the system, Includes implementation of three algorithms that perform the actual query of identifying frequent item sets. The algorithms require usage of sampling techniques, hashing techniques while using storage space in an efficient way. Due to the size of the data, the algorithms use disk as additional intermediate storage space. We assume that moderate disk space is available. An additional, more complex algorithm, is optional, and is designed for the case when intermediate storage is a very scarce resource and must be used in a minimal way.
Data management module: data will be stored on a commercialized database (preferably Informix or Oracle). This module will function as an interface between the database and the query engine. Implementation includes a layer of interface which will communicate with the gateway of the database. Prior knowledge of databases is useful but is not required. However, upon completion of this project, the members of the group will gain substantial knowledge in databases.

Project 3: Bundle sorting

Sorting is a frequent operation in many applications. It is used not only to produce sorted output, but also in many sort-based algorithms such as grouping with aggregation, duplicate removal, sort merge join, as well as set operations including union, intersect, etc.. Many data sets that are to be sorted, consist of limited number of distinct keys. Sorting such data sets can be thought of as bundling together identical keys, and having the bundles placed in order. Thus, this problem is denoted `bundle-sorting'. Recently it was shown that there is an algorithm to this problem that circumvents the lower bound for general sorting (as long as the number of distinct keys is moderate) by taking advantage of the number of distinct keys.

The purpose of this project is the implementation of this algorithm with a strong emphasis on performance. For comparison, a general sorting algorithm (merge sort) will also be implemented and execution times will be compared. The ultimate goal of this project is to provide an extremely efficient implementation and to be able to compete in one of the world-wide benchmarks for sorting algorithms where we hope to show an improvement in sorting algorithms for certain data sets over general purpose sorting algorithms.

The modules of this project are as follows:

Hard core algorithms: Efficient implementation of the bundle sorting algorithm and the merge-sort algorithm for comparison.
File management module: The sorted files will reside on disk. This module provides the interface for the algorithms and will be capable of performing the read and write operations on the files.
Performance monitor module: will keep track of the total performance time of the algorithms, performance of specific parts of it, etc.
User interface (optional): visualization of the sorting process while the sorting is performed. Since this will slow execution time, this module will be an add-on that can simply be turned off. This module involves more graphics.

Project 4: Indexed-based association rules

One of the most well studied problems in data mining is mining for association rules in market basket data. Association rules, whose significance is measured via support and confidence, are intended to identify rules of the type, ``A customer purchasing item A often also purchases item B.'' Motivated by the goal of generalizing beyond market baskets and the association rules used with them, we develop the notion of indexed based association rules. The idea is to partition the data according to some index, and find correlations between data items within each index value or among different indexes.

This project is much more research oriented compared to projects 1-3 and the exact algorithms are subject to development over the course of this project. Nevertheless, it includes the following software modules that must be implemented:

data management module capable of reading and writing data from the file system.
sorting algorithm for performing the sort according to indexes
module for identifying frequent item sets (just the hard-core algorithms from projects 1&2).
User interface for presenting results of the query.

This project has the option of developing into a nice paper since this territory of association rules has not been explored.

Internet Technologies:

Project 1: Implementation of Web assistants:

Web assistants are utilities that help users gain more values from browsing, by modifying web pages according to set of rules. The purpose of this project is to implement Web assistants in Java. Successful implementations may be incorporated into a larger scaled project. Examples for tasks that could be part of the project:

Glossary: Words in HTML pages are enhanced, according to a specified dictionary, with a link to a glossary that provides short explanation in a special window. The task is to provide an efficient glossary which involves an efficient implementation of a problem related to dictionary matching. It involves several rules such as context, synonyms etc.
Form handling: An application that takes care of filling forms with user information, based on given rules and data.

Other Web assistants are available for the projects. Ambitious students can also propose improvements and new Web assistants.

Project 2: Tools related to privacy over the Internet:

Details will be provided separately.

Image & Video databases:

Details will be provided separately.

Return to Workshop home page

For requests or corrections contact matias+seminar@math.tau.ac.il

Last updated April, 1999