Suggested Projects
Algorithms for massive data sets
Internet technologies
Image & video databases
Below is a list of suggested projects, concerning massive data sets
and Internet technologies. Further projects with Internet technologies
and projects concerning image & video databases may be provided
if there is explicit interest.
Projects 1 & 2: Identifying frequent items in massive data sets
Identifying frequent items in data sets is one of the fundamental
components in applications ranging from data mining, decision support
systems, data warehousing and query optimizations. The purpose of this
umbrella project is to build and optimize essential components of a
system for identifying frequent items in massive data sets. The system
should ultimately involve a highly efficient execution engine which
would be able to deal with large files (100s of Giga bytes and even
Tera bytes of data), and a user interface for entering specifications
and reporting results.
Project 1:
This project is written from scratch without use of any additional
software. The following modules compose the entire system of identifying
frequent item sets. Each module will be separate and will interact with
the other modules through a well defined API.
-
User Interface: graphical interface which allows the user to control
the parameters of the query including intervening with the query during
execution. The interface also displays intermediate and final results of
the query execution. Two add-ons (optional):
- graphical visualization of the results. This involves more graphics
and concerns the area of data visualization.
- providing a client over the Web. This involves HTML, Javascript, CGI.
-
Query engine: the hard core of the system, Includes implementation of
three algorithms that perform the actual query of identifying frequent
item sets. The algorithms require usage of sampling techniques, hashing
techniques while using storage space in an efficient way. Due to the
size of the data, the algorithms use disk as additional intermediate
storage space. We assume that moderate disk space is available. An
additional, more complex algorithm, is optional, and is designed for the
case when intermediate storage is a very scarce resource and must be
used in a minimal way.
-
Data management module: data will reside on the file system.
Implementation of this module includes implementation of an advanced and
novel sorting module, read and write operations on the data, and
providing an interface for the data to the query engine.
Project 2:
This project includes the use of commercialized databases (preferably
Informix or Oracle). The following modules compose the entire system of
identifying frequent item sets. Each module will be separate and will
interact with the other modules through a well defined API.
-
User Interface: graphical interface which allows the user to control
the parameters of the query including intervening with the query during
execution. The interface also displays intermediate and final results of
the query execution. Two add-ons (optional):
-
graphical visualization of the results. This involves more graphics
and concerns the area of data visualization.
-
providing a client over the Web. This involves HTML, Javascript, CGI.
-
Query engine: the hard core of the system, Includes implementation of
three algorithms that perform the actual query of identifying frequent
item sets. The algorithms require usage of sampling techniques, hashing
techniques while using storage space in an efficient way. Due to the
size of the data, the algorithms use disk as additional intermediate
storage space. We assume that moderate disk space is available. An
additional, more complex algorithm, is optional, and is designed for the
case when intermediate storage is a very scarce resource and must be
used in a minimal way.
-
Data management module: data will be stored on a commercialized
database (preferably Informix or Oracle). This module will function as
an interface between the database and the query engine. Implementation
includes a layer of interface which will communicate with the gateway of
the database. Prior knowledge of databases is useful but is not
required. However, upon completion of this project, the members of the
group will gain substantial knowledge in databases.
Project 3: Bundle sorting
Sorting is a frequent operation in many applications. It is used not
only to produce sorted output, but also in many sort-based algorithms
such as grouping with aggregation, duplicate removal, sort merge join,
as well as set operations including union, intersect, etc.. Many data
sets that are to be sorted, consist of limited number of distinct keys.
Sorting such data sets can be thought of as bundling together identical
keys, and having the bundles placed in order. Thus, this problem is
denoted `bundle-sorting'. Recently it was shown that there is an
algorithm to this problem that circumvents the lower bound for general
sorting (as long as the number of distinct keys is moderate) by taking
advantage of the number of distinct keys.
The purpose of this project is the implementation of this algorithm with
a strong emphasis on performance. For comparison, a general sorting
algorithm (merge sort) will also be implemented and execution times
will be compared. The ultimate goal of this project is to provide an
extremely efficient implementation and to be able to compete in one
of the world-wide benchmarks for sorting algorithms where we hope to
show an improvement in sorting algorithms for certain data sets over
general purpose sorting algorithms.
The modules of this project are as follows:
-
Hard core algorithms: Efficient implementation of the bundle sorting
algorithm and the merge-sort algorithm for comparison.
-
File management module: The sorted files will reside on disk. This
module provides the interface for the algorithms and will be capable of
performing the read and write operations on the files.
-
Performance monitor module: will keep track of the total performance
time of the algorithms, performance of specific parts of it, etc.
-
User interface (optional): visualization of the sorting process while
the sorting is performed. Since this will slow execution time, this
module will be an add-on that can simply be turned off. This module
involves more graphics.
Project 4: Indexed-based association rules
One of the most well studied problems in data mining is mining for
association rules in market basket data. Association rules, whose
significance is measured via support and confidence, are intended to
identify rules of the type, ``A customer purchasing item A often also
purchases item B.'' Motivated by the goal of generalizing beyond market
baskets and the association rules used with them, we develop the notion
of indexed based association rules. The idea is to partition the data
according to some index, and find correlations between data items within
each index value or among different indexes.
This project is much more research oriented compared to projects 1-3 and
the exact algorithms are subject to development over the course of this
project. Nevertheless, it includes the following software modules that
must be implemented:
-
data management module capable of reading and writing data from the
file system.
-
sorting algorithm for performing the sort according to indexes
-
module for identifying frequent item sets (just the hard-core
algorithms from projects 1&2).
-
User interface for presenting results of the query.
This project has the option of developing into a nice paper since this
territory of association rules has not been explored.
Project 1: Implementation of Web assistants:
Web assistants are utilities that help users gain more values from
browsing, by modifying web pages according to set of rules. The
purpose of this project is to implement Web assistants in Java.
Successful implementations may be incorporated into a larger scaled
project. Examples for tasks that could be part of the project:
-
Glossary:
Words in HTML pages are enhanced, according to a
specified dictionary, with a link to a glossary that provides short
explanation in a special window. The task is to provide an efficient
glossary which involves an efficient implementation of a problem
related to dictionary matching. It involves several rules such as
context, synonyms etc.
-
Form handling:
An application that takes care of filling forms with
user information, based on given rules and data.
Other Web assistants are available for the projects. Ambitious
students can also propose improvements and new Web assistants.
Project 2: Tools related to privacy over the Internet:
Details will be provided separately.
Details will be provided separately.
Return to Workshop home page
For requests or corrections contact
matias+seminar@math.tau.ac.il
Last updated April, 1999