Workshop in Bioinformatics
(0368-3500-34)
Instructor : Prof. Benny Chor, Schrieber 223.
Teaching Assistant : Amos Tanay, Computational Genomics Lab, Schrieber
011.
Where : Shenkar (Physics), Rm 114
When : Mondays 11:00-13:00, Spring 2003.
Mailing list at http://listserv.tau.ac.il/archives/cs0368-3500-32.html
The workshop will provide hands on experience in implementing
various optimization techniques
and applying them to combinatorial optimization problems with computational
biology origins.
Many real world problems are NP hard or even hard to approximate, so
often one employs heuristics
with no mathematical guarantee of performance, that "behave well" on
practical problem instances.
Computational Biology offer some of the hottest (and most difficult)
optimization problem in science today,
frequently involving huge solution spaces and multi goals. We shall learn
and then apply some of the best
heuristics to study questions in areas such as analyzing gene expression
data sets from DNA chips, phylogenetic
tree reconstruction, multiple sequence alignment, gene finding, RNA secondary
structure prediction, and others.
An integral part of the workshop is giving
a presentation of the projects to all participants
(not just the course
staff). It is mandatory to be physically present during this entire part
of the workshop, which is expected to take
during the last three meetings (May 19, May 26, June 2nd). Naturally,
your projects will have to be completed
by that time.
Goals, scope :
- Analysis, design and implementation of combinatorial optimization
algorithms
- Using hybrid algorithms for real world problem solving
- Some contemporary problems in comparative genomics, DNA chips
analysis and more.
- Building efficient local search procedures
- Efficient implementation of algorithms in C++ and STL.
Timetable
March 23
|
Project(s) preferences submitted
|
March 31
|
Project selection confirmed
|
April 7
|
A two pages plan (general approach and software design)
submitted
|
April 28
|
Progress report submitted.
|
May 19- June 2
|
Presentation given in class.
Final reports submitted.
|
Assigned Projects
Liat greenshtein and Taly bukra
|
The signature algorithm for gene expression analysis
|
Sharon Mograbi and Oren Bahat
|
Regulatory motif search in cancer data |
Shai Ophir
|
Linear motif-expression model optimization using GA
|
Ido Liberti and Yariv Eisenberg
|
Implementation of SVM learning with a small number
of features
|
|
|
|
|
Suggested Workshop Projects:
1. Implementation of SVM learning with a small number of features
Support vectore machines (SVMs) are one of the most effective and widely
used approaches to supervised learning.
In the basic schenario, we are given two sets of points in some Euclidean
space, and wish to find a linear
separator, or a hyperplane, separating the two sets. If such separation
is possible, we are further interested in an optimal
one, where optimality is defined in terms of an induced margin. Given two
such sets of points, the optimization problem
is a quadratic optimazation problem. There are some solutions available
on the web, but they tend to be slower and
produce numerical instabilities when the dimension is low (2 to
10, say). The task here is to write your own SVM code and test it
on data from various biological sources. See the excellent book Learning with Kernels: Support
Vector Machines,
Regularization, Optimization and Beyond, by Bernhard Schölkopf
and Alex Smola, MIT Press, 2002.
2. Fast linear separation in low dimensional
spaces
Support vectore machines (SVMs) are one of the most effective and widely
used approaches to supervised learning.
In the basic schenario, we are given two sets of points in some Euclidean
space, and wish to find a linear
separator, or a hyperplane, separating the two sets. If such separation
is possible, we are further interested in an optimal
one, where optimality is defined in terms of an induced margin. Now in
many cases most data sets will not be separable,
so in such cases it makes sense to first filter the data and check for
linear separability, and apply the relatively heavy SVM
code only in positive cases.
Your task is to write a fast separability checker. It will work in low
dimension (2 to 10) but should be able to check 10 million
sets of size (50,50) in approximately 1-2 minutes. If interested,
please approach the instructor for more details.
3. Linear motif-expression model optimization using GA
The goal in this project is to model gene
expression in a set of different conditions as a linear combination of
transcription factor activities, which are detected via appearances of
specific DNA signals upstream each of the genes. Specifically, we are given
a set of genes, each with a promoter sequence and a vector of gene
expression in various conditions. The promoter should in principle contain
sequence motifs that determine the level of activity of the downstream
gene at each condition. We make the simplifying assumption by which a
small set of sequence motifs (6-mers, 7-mers) fully determine expression.
The idea is that the expression of each gene is a linear combination of
the activities of those k-mers that
are present in its promoter. We do not know in advance which motifs are
important and what are their activities. To find these, we perform a genetic
algorithm to find the motifs
and least square minimization to find the activities given the motifs.
The project will use yeast data for testing a hopefully efficient implementation
of the algorithm.
Ref: Bussemaker
et al. Nature genetics, 2001.
4. The Genome Alignment problem
The idea in
this project is to find significant similarities in two complete genomes
(human and mouse). We know that human and mouse genomes are rather conserved
but
simple alignment of their sequences is not practical due to many rearrangement
of the genome which permute parts of the two sequence with respect to their
common ancestor. A simple
strategy for identifying conserved regions in mouse and human is using
local alignment, but such strategy possibly lose much structure. Our goal
here is to examine several possibilities
for improved alignment of conserved regions. The development should be
theoretical in part, with prototyping to prove the concepts.
Ref: Mouse genome
issue, Nature 2002.
5. Regulatory motif search in cancer data
This project
will be dedicated for analysis of cancer gene expression and searching for
DNA motifs in promoter of co-regulated genes. The idea is to search exhaustively
for DNA
motifs that characterize significantly up or down regulated cancer genes
and then generalize them using a simple EM algorithm. The exhaustive search
will use hashing for efficient
screening of all k-mers with possible degeneracy. The EM algorithm will
use high scoring exact motif and generalize them into a PSSM (simple Markov
model) such that the mean
expression of genes with the motif will be optimized. We will test the
algorithm on clinical data from lung and breast cancer studies.
Ref: Lung cancer
paper
Breast cancer paper
Bussemaker paper
6. The Signature algorithm for gene expression analysis
The signature
algorithm is a simple method for finding dependency patterns (or biclusters)
in gene expression data. The algorithm is a two phase process using as
input
a collection of genes and a matrix of gene expression. The idea is to
start with some interesting set of genes and refine it by zooming on all
condition which somehow characterize it.
The main technique is the analysis of mean expression in the set of genes
or conditions and the comparison of it to an expected normal distribution.
The goal of this project is
to test the signature algorithm with real and synthetic data and present
detail analysis of its behavior (analysis here is as important then the
simple implementation).
Ref: Ihmels
et al. Nature Genetics 2002
7. Finding network motifs
Recent experimental
techniques in biology have led to the collection of large number of interaction
among different biological factors including
genes and proteins. An interesting global view of these data collections
stress properties of the interaction graphs and seeks for simple, yet deep,
phenomenon in its structure. A popular model in that respect is the scale
free model (or power law generative model which was successful used to model
the internet). A different approach was recently suggested by Shen-Orr
et al. The idea is to search for over-represented network motifs (small
isomorphic subgraphs) and view the graph as a combination of such subgraphs.
In this project we will search for such motifs in protein network
graphs available on the yeast system. To find network motifs we count
their appearence compared to a random graph model preserving some characteristics
of the graph. Apart from programming the motif search algorithm efficiently,
you will have to analyze your results carefully, making sure motifs are
not artifacts.
Ref: Shen-Orr
et al. Nature Genetics 2002
Milo et al. Science 2002
8. Gene Prediction by Spectral Rotation (SR) Measure – a new method
for identifying protein-coding regions.
This project is based on a recent work by D. Kotlar and Y. Lavner, where
spectral methods and FFT are used for gene finding in
eukaryots. You will have to implement the algorithm, apply it to
several genomes, and compare the results to other methods and to
published literature.
If interested, please contact the instructor for more details.