The random projection algorithm is a randomized algorithm for motif finding. The input is a set of strings and the goal is to find substrings which appears in disproportional fraction of the strings. A challenge problem in this domain is defined by finding planted (l,d) motifs in n random strings of length m where an (l,d) motif is a string of length l with at most d point mutations. The random projection algorithm is very simple and elegant, using randomly selected subsets of the substring and hashing over all the strings.
Refs:
What to do:
2. Genscan programming
Overview:
Genscan is a popular gene finding program by Chirs Burge. The goal here
is to implement a gene finding
algorithm and test it on the human genome.
Refs:
1.Burge, C. and Karlin, S. (1997) Prediction of complete gene structures
in human genomic DNA. J. Mol. Biol. 268, 78-94.
2.GenScan site
What to do:*******************
3. Coding Blast
Overview:
BLAST is a fast local alignment algorithm,
which enjoys very wide popularity
Ref:s
in the biological community and is used
by thousands of researchers on a
daily basis. The goal is to search for
significant local alignments ofa
target
sequence in a large sequences database.
In this project you shallprogram
BLAST and test its performence
on a sample sequence database.
1.Blast
at NCBI
2. Altschul, S.F., Gish,
W., Miller, W., Myers, E.W. & Lipman,
D.J. (1990) "Basic local alignment search tool." J. Mol. Biol. 215:403-410.
What to do:
ClustalW is a progressive multiple alignment algorithm which use several heuristic to improve the performence of HMM based alignment tool. The idea is to incrementaly refine a Profile HMM that will anchor the target sequences.
Refs:
1. ClustalW
at EBI
2. Higgins D., Thompson J., Gibson T.Thompson
J.D., Higgins D.G., Gibson T.J.(1994). CLUSTAL
W: improving the sensitivity of progressivemultiple
sequence alignment through sequence weighting,position-specific
gap penalties and weight matrix choice. Nucleic Acids Res. 22:4673-4680
What to do:
5. SVM using SMO, apply to ALL/AML
cancer data
Overview:
SVMs
(Support Vector machines) are powerful classifiers which are in use
in several machine learning applications.
We are given a set of examples, each
charachterized
by an n dimensional real valued vector and a binary number (yes
or no). The SVM predict the binary number
(the class) based on the vector
by test the position of the vector relative
to some hyperplane (defined by
its normal vector). SMO is a fast SVM
training algorithm which is very
easy to implement. In this project you
shall implement SMO and use it
to classify cancer patients. The data
you will use is derived from
DNA chips that can measure the activity
of thousands of genes in a given
tissue. Your SVM will try to distinguish
between patients with different
types of tumor based on the gene activity
(or expression) profile.
1. Platt's
web site on SMO
2. Book
chapter about SMO
3. Golub
et al., Science Oct 15, 531,537
What to do:
Suffix trees were (or will be)Ý introduced
in class. The goal here is to implement and
experiment with them paying attention
to performence and implementation
efficiency. To test your algorithm performence
you shall use gnu gprof.gprof
can be used with gcc
compiled programs and provide detailed report on
the time spent at each of the routines
used by a given program. By analyzing
gprofreoprt
you can verify your implementation does not spend time in the
"wrong" places, identify bottle necks
and try to remove them. A frequent
example in that respect is to optimize
the use of pointer dereferecing, to
use iterators
instead of direct access to vector objects and to cache
computation whenever possible.
1. Gusfiled
book
2. Course notes
What to do:
The goal here is to train a profile HMM
that can identify protein families.
The HMM topology is a combination of
two standard profile HMM with common
start and end states. During the training
process, we hope that one
profile HMM will absorb one type of
sequences and the other will take
care of the rest.
Refs:
1. Durbin's Book
2. Class Notes
What to do:
Ý8.
TheHadamard Conjugation on Phylogenetic
Trees Ý
Overview:
The Hadamard conjugation (Hendy and PennyÝ 1993, Hendy, Penny, and
SteelÝ 1994) is a Fourier type transformation linking the edges lengths of an evolutionary
tree $T$Ý to the probabilities of obtaining given sequences from the same tree.
The transformation is applicable to a numberÝ of simple models of site substitution, like Neyman
2 state model
(Neyman 1971), and Jukes--Cantor 4 state model
(Jukes and CantorÝ 1969).Refs:
1. M. D. Hendy and D. Penny, Spectral analysis of phylogenetic data,
Journal of Classification, 1993, vol. 10, pp. 5-24.
2. M. D. Hendy, D. Penny and M.A. Steel, Discrete Fourier analysis for evolutionary trees,
PNAS, vol. 91, pp. 3339-3343, 1994.
3. B. Chor, M. Hendy, B. Holland, and D. Penny,`Multiple Maxima of Likelihood in
Phylogenetic Trees: An Analytic Approach.'', Molecular Biology and EvolutionVol. 17, No.10, September 2000, pp. 1529--1541.
What
to do:
11.
Radiation Hybrid Mapping
Radiation hybrid (RH) mapping is a somatic cell technique for ordering markers
along a chromosome, and estimating the physical distances between them.
This mapping techniques was used extensively in the Human genome project,
and is still widely used in ongoing genome projects like the (wounded) cat, dog, and cow.
Analyzing the experimental data is still a challenging and demanding computational task.
In the first two papers referenced below the software package
RHO (radiation hybrid ordering) is described. The package implements a number of
heuristics that attempt to order genomic markers along a chromosome,
given as input the results of a radiation hybrid experiment.
The heuristics are based on reducing an appropriate optimization problem to
TSP (traveling salesman problem). The reduced optimization problemis either the non-parametric OCB (obligate chromosome breaks),
or the parametric MLE (maximum likelihood estimation).
The third paper describes similar approaches, based on TSP, using a publicly available software package.
References
1. A. Ben-Dor, B. Chor, and D. Pelleg, ``Radiation Hybrid Ordering (RHO)'',
Genome Research , Vol. 10, Issue 3, March 2000 ,pp. 365--378.
2. A. Ben-Dor and B. Chor, ``On Constructing Radiation Hybrid Maps'', Jour. of Comput. Biology, Vol. 4, No. 4, November 1997, pp. 517--533.
What
to do:
Comment:This
project is harder than most projects but can lead to an M.Sc. thesis.
13. Antibodies Family Identification
Antibodies are proteins that are "designed" to specifically identify and target "foreign" antigenes (typically proteins themselves).
Comment: This project is harder than most projects, and unusual in the sense that we are not aware of any existing solutions. Of course, these disadvantages makes it a potentially good target for an M.Sc. thesis.