1

1. Random Projections for motifs finding

Overview:

The random projection algorithm is a randomized algorithm for motif finding. The input is a set of strings and the goal is to find substrings which appears in disproportional fraction of the strings. A challenge problem in this domain is defined by finding planted (l,d) motifs in n random strings of length m where an (l,d) motif is a string of length l with at most d point mutations. The random projection algorithm is very simple and elegant, using randomly selected subsets of the substring and hashing over all the strings.

Refs:

1.Buhler and Tompas

What to do:

Read and understand Buhler and Tompa's paper
Program the random projection algorithm for motif finding.
Test your algorithm on syntetic (15,4), (14,4) and (18,5) datasets (see the paper).
Report: project overview (1-2 pages), source code, report on your results on each of the problem instances, run time statistics.

2. Genscan programming

Overview:

Genscan is a popular gene finding program by Chirs Burge. The goal here is to implement a gene finding
algorithm and test it on the human genome.

Refs:

1.Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol. 268, 78-94.
2.GenScan site

What to do:*******************

3. Coding Blast

Overview:

BLAST is a fast local alignment algorithm, which enjoys very wide popularity
in the biological community and is used by thousands of researchers on a
daily basis. The goal is to search for significant local alignments ofa target
sequence in a large sequences database. In this project you shallprogram
BLAST and test its performence on a sample sequence database.

Ref:s

1.Blast at NCBI
2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.

What to do:

Read and understand Altschul et al. paper
Go to NCBI site, BLAST the sequence of the p53 gene against all human sequences from genbank.
Now implement blast: Do not implement the E statistics.
Apply your version to p53 sequence and any sample of 100,000 protein sequences from genbank.
Report: Project overview (1-2 pages), source code, result with NCBI BLAST and with your BLAST.

4. Multiple Alignment :ClustalW

Overview:

ClustalW is a progressive multiple alignment algorithm which use several heuristic to improve the performence of HMM based alignment tool. The idea is to incrementaly refine a Profile HMM that will anchor the target sequences.

Refs:

1. ClustalW at EBI
2. Higgins D., Thompson J., Gibson T.Thompson J.D., Higgins D.G., Gibson T.J.(1994). CLUSTAL W: improving the sensitivity of progressivemultiple sequence alignment through sequence weighting,position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680

What to do:

Read and understand the CLUSTALW paper, you may also use Durbin's book.
Go to GenBank and download a set of globin sequences
go to www.ebi.ac.uk/clustalw, and align your set of globins
Now program ClustalW: The Profile HMM progressive alignment and the heuristics guiding it.
Apply your algorithm to the same set of globins
Submit: project overview (1-2 pages), source code, output of your program and output of the ebi clustalw implementation.

5. SVM using SMO, apply to ALL/AML cancer data

Overview:

SVMs (Support Vector machines) are powerful classifiers which are in use
in several machine learning applications. We are given a set of examples, each
charachterized by an n dimensional real valued vector and a binary number (yes
or no). The SVM predict the binary number (the class) based on the vector
by test the position of the vector relative to some hyperplane (defined by
its normal vector). SMO is a fast SVM training algorithm which is very
easy to implement. In this project you shall implement SMO and use it
to classify cancer patients. The data you will use is derived from
DNA chips that can measure the activity of thousands of genes in a given
tissue. Your SVM will try to distinguish between patients with different
types of tumor based on the gene activity (or expression) profile.

1. Platt's web site on SMO
2. Book chapter about SMO
3. Golub et al., Science Oct 15, 531,537

What to do:

Read and understand plat's paper and Golub's work
Implement SMO
Email us to get the gene expression data set and the correct tissue classification
Train an SVM using your algorithm each time hiding exactly one tissue classifiaction (i.e. train with all but one tissues, predict the missing tissue class). Report your success rate. This type of analysis is called leave one out cross-validation.
Perform feature selection as in Golub paper, report your leave one out results with feature selection of 50 genes and 10 genes.
Report: Project overview (1-2 pages), source code, results on the data set with and without feature selection.

6. Suffix trees library

Overview:

Suffix trees were (or will be)Ý introduced in class. The goal here is to implement and
experiment with them paying attention to performence and implementation
efficiency. To test your algorithm performence you shall use gnu gprof.gprof
can be used with gcc compiled programs and provide detailed report on
the time spent at each of the routines used by a given program. By analyzing
gprofreoprt you can verify your implementation does not spend time in the
"wrong" places, identify bottle necks and try to remove them. A frequent
example in that respect is to optimize the use of pointer dereferecing, to
use iterators instead of direct access to vector objects and to cache
computation whenever possible.

1. Gusfiled book
2. Course notes

What to do:

Read the chapter on suffix trees in Gusfieldís book and the class notes.
Implement a suffix tree object in C++ preferably using STL.
Write a test program that store a long DNA string (50kb) and search for it.
profile your code by compiling it with gcc and the flag -pg and by running
gprof > out after the compiled program was finished succesfully.
Test your program bottle necks, try improving performence by changing your implementation.
Report: Project overview (1-2 pages), source code, run time statistics, gprof output before and after implementation changes.

7. Protein family detection using HMM

Overview:

The goal here is to train a profile HMM that can identify protein families.
The HMM topology is a combination of two standard profile HMM with common
start and end states. During the training process, we hope that one
profile HMM will absorb one type of sequences and the other will take
care of the rest.

Refs:

1. Durbin's Book
2. Class Notes

What to do:

Read course notes on HMM and consult Durbin relevant chapters.
Program the standard HMM algorithms (Viterbi, Forward, Backward, EM)
Use the above to build and train the classifier HMM.
Consider using model surgery to improve performence.
Apply your algorithm to a set of globins from genbank.
Report: Project overview (1-2 pages), source code, results on the globins including number of EM iterations, trained HMM, and the classified set of proteins (to which copy of the profile each protein was selected).

Ý8. TheHadamard Conjugation on Phylogenetic Trees Ý

Overview:

The Hadamard conjugation (Hendy and PennyÝ 1993, Hendy, Penny, and

SteelÝ 1994) is a Fourier type transformation linking the edges lengths of an evolutionary

tree $T$Ý to the probabilities of obtaining given sequences from the same tree.

The transformation is applicable to a numberÝ of simple models of site substitution, like Neyman

2 state model (Neyman 1971), and Jukes--Cantor 4 state model (Jukes and CantorÝ 1969).
For these models, the transformation yields a powerful tool which greatly simplifies and
unifies the analysis of phylogenetic data.

The transformation runs in exponential time (n*2^n), but a careful implementation should yield codethat runs with n up to 20 (or even more) in a reasonable amount of time.

The goal here is to implement the Hadamard conjugation, first on ìtwo stateî DNA (purines/pyramidines) and then on ìnormalî, four state DNA. Run the code on simulated and ìrealî DNA sequences, and derive biological consequences on the phylogenetic tree that best fits the data.

Refs:

1. M. D. Hendy and D. Penny, Spectral analysis of phylogenetic data,

  Journal of Classification, 1993, vol. 10, pp. 5-24.

2. M. D. Hendy, D. Penny and M.A. Steel, Discrete Fourier analysis for evolutionary trees,

    PNAS, vol. 91, pp. 3339-3343, 1994.

3. B. Chor, M. Hendy, B. Holland, and D. Penny,`Multiple Maxima of Likelihood in

   Phylogenetic Trees: An Analytic Approach.'', Molecular Biology and Evolution

Vol. 17, No.10, September 2000, pp. 1529--1541.

What to do:

Read the papers and understand the transformation (start with 2 states characters).
Program the Hadamard conjugation.
Apply your algorithm to a set of examples (to be supplied by the lecturer).
Report: Project overview (1-2 pages), source code, results on the sequences, including graphical ìbarsî representations of edge lengths, and most reasonable trees.

9. Maximum Likelihood Phylogenetic Trees Reconstruction

10. Maximum Parsimony Phylogenetic Trees Reconstruction

11. Radiation Hybrid Mapping

Radiation hybrid (RH) mapping is a somatic cell technique for ordering markers

along a chromosome, and estimating the physical distances between them.

This mapping techniques was used extensively in the Human genome project,

and is still widely used in ongoing genome projects like the (wounded) cat, dog, and cow.

Analyzing the experimental data is still a challenging and demanding computational task.

In the first two papers referenced below the software package

RHO (radiation hybrid ordering) is described. The package implements a number of

heuristics that attempt to order genomic markers along a chromosome,

given as input the results of a radiation hybrid experiment.

The heuristics are based on reducing an appropriate optimization problem to

TSP (traveling salesman problem). The reduced optimization problemis either the non-parametric OCB (obligate chromosome breaks),

or the parametric MLE (maximum likelihood estimation).

The third paper describes similar approaches, based on TSP, using a publicly available software package.

References

1. A. Ben-Dor, B. Chor, and D. Pelleg, ``Radiation Hybrid Ordering (RHO)'',

Genome Research , Vol. 10, Issue 3, March 2000 ,pp. 365--378.

2. A. Ben-Dor and B. Chor, ``On Constructing Radiation Hybrid Maps'',  Jour. of Comput. Biology, Vol. 4, No. 4,  November 1997, pp. 517--533.

3. Agarwala R, Applegate DL, Maglott D, Schuler GD, Schaffer AA.ÝìA fast and scalable radiation hybrid map construction

and integration strategyî. Genome Res. 2000 Mar;10(3):350-64.

What to do:

Read the papers and understand the underlying techniques (including the Held-Karp method for obtaining lower bounds on TSP.
Program the hueristics for upper and lower bounds (you can use public available TSP code, like that used in reference (3), or like the LEDA package.
Apply your algorithm to a set of synthetic and "real" data from various organisms (to be discussed with the lecturer).
Compare your maps to other RH maps published in the literature.
Report: Project overview (1-2 pages), source code, results on the sequences, including graphical ìbarsî representations of edge lengths, and most reasonable trees.

12. OPSM

Order preserving submatrices is a tool to analyze DNA microarray data.

Comment:This project is harder than most projects but can lead to an M.Sc. thesis.

13. Antibodies Family Identification

Antibodies are proteins that are "designed" to specifically identify and target "foreign" antigenes (typically proteins themselves).

The high specificity of the antibodies originates from highly variable intervals of DNA, that are generated using mechanisms such as

high mutation rate and recombinations. These mechanisms make the task of grouping antibodies to families harder, especially to the

"linearly oriented" HMM model. The goal of this project is to develop a toolbox for tackling this problem. An efficient solution may save a lot of redundant lab work, employed in current technologies when trying to identify the activity and target of specific antibodies.

I do not have a good survey paper yet, but if you're interested in this project please approach me directly.

Comment: This project is harder than most projects, and unusual in the sense that we are not aware of any existing solutions. Of course, these disadvantages makes it a potentially good target for an M.Sc. thesis.