Modern Optimization in Computational Biology

Workshop in Bioinformatics
(0368-3500-34)

Instructor : Prof. Benny Chor, Schrieber 223.
Teaching Assistant : Amos Tanay, Computational Genomics Lab, Schrieber 011.
Where : Shenkar (Physics), Rm 114
When : Mondays 11:00-13:00, Spring 2003.
Mailing list at http://listserv.tau.ac.il/archives/cs0368-3500-32.html

The workshop will provide hands on experience in implementing various optimization techniques
and applying them to combinatorial optimization problems with computational biology origins.
Many real world problems are NP hard or even hard to approximate, so often one employs heuristics
with no mathematical guarantee of performance, that "behave well" on practical problem instances.
Computational Biology offer some of the hottest (and most difficult) optimization problem in science today,
frequently involving huge solution spaces and multi goals. We shall learn and then apply some of the best
heuristics to study questions in areas such as analyzing gene expression data sets from DNA chips, phylogenetic
tree reconstruction, multiple sequence alignment, gene finding, RNA secondary structure prediction, and others.

An integral part of the workshop is giving a presentation of the projects to all participants (not just the course
staff). It is mandatory to be physically present during this entire part of the workshop, which is expected to take
during the last three meetings (May 19, May 26, June 2nd). Naturally, your projects will have to be completed
by that time.

Goals, scope :

Analysis, design and implementation of combinatorial optimization algorithms
Using hybrid algorithms for real world problem solving
Some contemporary problems in comparative genomics, DNA chips analysis and more.
Building efficient local search procedures
Efficient implementation of algorithms in C++ and STL.

Timetable

March 23	Project(s) preferences submitted
March 31	Project selection confirmed
April 7	A two pages plan (general approach and software design) submitted
April 28	Progress report submitted.
May 19- June 2	Presentation given in class. Final reports submitted.

Assigned Projects

Liat greenshtein and Taly bukra	The signature algorithm for gene expression analysis
Sharon Mograbi and Oren Bahat	Regulatory motif search in cancer data
Shai Ophir	Linear motif-expression model optimization using GA
Ido Liberti and Yariv Eisenberg	Implementation of SVM learning with a small number of features

Suggested Workshop Projects:

1. Implementation of SVM learning with a small number of features

Support vectore machines (SVMs) are one of the most effective and widely used approaches to supervised learning.
In the basic schenario, we are given two sets of points in some Euclidean space, and wish to find a linear
separator, or a hyperplane, separating the two sets. If such separation is possible, we are further interested in an optimal
one, where optimality is defined in terms of an induced margin. Given two such sets of points, the optimization problem
is a quadratic optimazation problem. There are some solutions available on the web, but they tend to be slower and
produce numerical instabilities when the dimension is low (2 to 10, say). The task here is to write your own SVM code and test it
on data from various biological sources. See the excellent book Learning with Kernels: Support Vector Machines,
Regularization, Optimization and Beyond, by Bernhard Schölkopf and Alex Smola, MIT Press, 2002.

2. Fast linear separation in low dimensional spaces

Support vectore machines (SVMs) are one of the most effective and widely used approaches to supervised learning.
In the basic schenario, we are given two sets of points in some Euclidean space, and wish to find a linear
separator, or a hyperplane, separating the two sets. If such separation is possible, we are further interested in an optimal
one, where optimality is defined in terms of an induced margin. Now in many cases most data sets will not be separable,
so in such cases it makes sense to first filter the data and check for linear separability, and apply the relatively heavy SVM
code only in positive cases.

Your task is to write a fast separability checker. It will work in low dimension (2 to 10) but should be able to check 10 million
sets of size (50,50) in approximately 1-2 minutes. If interested, please approach the instructor for more details.

3. Linear motif-expression model optimization using GA

       The goal in this project is to model gene expression in a set of different conditions as a linear combination of transcription factor activities, which are detected via appearances of specific DNA signals upstream each of the genes. Specifically, we are given a set of genes, each with a promoter sequence and a vector of gene expression in various conditions. The promoter should in principle contain sequence motifs that determine the level of activity of the downstream gene at each condition. We make the simplifying assumption by which a
small set of sequence motifs (6-mers, 7-mers) fully determine expression. The idea is that the expression of each gene is a linear combination of the activities of those k-mers that
are present in its promoter. We do not know in advance which motifs are important and what are their activities. To find these, we perform a genetic algorithm to find the motifs
and least square minimization to find the activities given the motifs. The project will use yeast data for testing a hopefully efficient implementation of the algorithm.

            Ref: Bussemaker et al. Nature genetics, 2001.

4. The Genome Alignment problem

            The idea in this project is to find significant similarities in two complete genomes (human and mouse). We know that human and mouse genomes are rather conserved but
simple alignment of their sequences is not practical due to many rearrangement of the genome which permute parts of the two sequence with respect to their common ancestor. A simple
strategy for identifying conserved regions in mouse and human is using local alignment, but such strategy possibly lose much structure. Our goal here is to examine several possibilities
for improved alignment of conserved regions. The development should be theoretical in part, with prototyping to prove the concepts.

            Ref: Mouse genome issue, Nature 2002.

5. Regulatory motif search in cancer data

            This project will be dedicated for analysis of cancer gene expression and searching for DNA motifs in promoter of co-regulated genes. The idea is to search exhaustively for DNA
motifs that characterize significantly up or down regulated cancer genes and then generalize them using a simple EM algorithm. The exhaustive search will use hashing for efficient
screening of all k-mers with possible degeneracy. The EM algorithm will use high scoring exact motif and generalize them into a PSSM (simple Markov model) such that the mean
expression of genes with the motif will be optimized. We will test the algorithm on clinical data from lung and breast cancer studies.

            Ref: Lung cancer paper
                Breast cancer paper
                    Bussemaker paper

6. The Signature algorithm for gene expression analysis
            The signature algorithm is a simple method for finding dependency patterns (or biclusters) in gene expression data. The algorithm is a two phase process using as input
a collection of genes and a matrix of gene expression. The idea is to start with some interesting set of genes and refine it by zooming on all condition which somehow characterize it.
The main technique is the analysis of mean expression in the set of genes or conditions and the comparison of it to an expected normal distribution. The goal of this project is
to test the signature algorithm with real and synthetic data and present detail analysis of its behavior (analysis here is as important then the simple implementation).

            Ref: Ihmels et al. Nature Genetics 2002

7. Finding network motifs

            Recent experimental techniques in biology have led to the collection of large number of interaction among different biological factors including
genes and proteins. An interesting global view of these data collections stress properties of the interaction graphs and seeks for simple, yet deep,
phenomenon in its structure. A popular model in that respect is the scale free model (or power law generative model which was successful used to model the internet). A different approach was recently suggested by Shen-Orr et al. The idea is to search for over-represented network motifs (small isomorphic subgraphs) and view the graph as a combination of such subgraphs. In this project we will search for such motifs in protein network
graphs available on the yeast system. To find network motifs we count their appearence compared to a random graph model preserving some characteristics of the graph. Apart from programming the motif search algorithm efficiently, you will have to analyze your results carefully, making sure motifs are not artifacts.

            Ref: Shen-Orr et al. Nature Genetics 2002
                    Milo et al. Science 2002

8. Gene Prediction by Spectral Rotation (SR) Measure – a new method for identifying protein-coding regions.

This project is based on a recent work by D. Kotlar and Y. Lavner, where spectral methods and FFT are used for gene finding in
eukaryots. You will have to implement the algorithm, apply it to several genomes, and compare the results to other methods and to
published literature.

If interested, please contact the instructor for more details.

Workshop in Bioinformatics (0368-3500-34)

Timetable

Assigned Projects

Suggested Workshop Projects:

Workshop in Bioinformatics
(0368-3500-34)