Workshop in Bioinformatics

(0368-3500-34)

Instructor : Prof. Benny Chor (benny@cs.tau.ac.il), Schreiber, room 223.
Teaching Assistants : Tamir Tuller (tamirtul@post.tau.ac.il), Schreiber, room M20 (basement), and Giora Unger (giorau@post.tau.ac.il), Schreiber, room 012.
Where : Schreiber, Room 008.
When : Mondays 11:00-13:00, Spring 2004.

Students of the bioinformatics track will have higher priority in registering to this workshop.

Goals & scope

Analysis, design and implementation of combinatorial optimization algorithms with bioinformatics relevance.
Some contemporary problems in comparative genomics, DNA chips analysis, phylogenetic analysis, and more.
Getting acquainted with publicly available Bioinformatics databases and using them.
Conducting supervised research in computational biology.
Efficient implementation of algorithms in C, C++, Java or Matlab.

The workshop will provide hands on experience in implementing various optimization techniques, developing statistical tools for analyzing biological data, and using public biological databases. Most (but not all) projects have both research and implementation aspects. We offer projects in various current subjects in computational biology, including analysis of gene expression data and DNA microarrays, analysis of gene structure, understanding genomic pathways, construction of phylogenetic trees and networks, etc.

The workshop also includes four lectures on various topics in software engineering, given by the Computer Science system staff. These will include subjects relevant to ant medium to large scale software project. Students are expected to utilize these tools and techniques in the final projects and documentation. Attending these four lectures is mandatory.

An integral part of the workshop is presenting an outline of the project, as well as the completed project,
to all participants (not just the course staff). It is mandatory to physically attend class during these entire parts of the workshop,
which are expected to take place on March 29th (outlines presentation) and during the last three meetings (final projects presentation).
Naturally, your projects should be completed by then, and submitted no later than the last day of the semester.

Projects will be performed in groups consisting of at most two students (if the number of students is odd, one triplet may be allowed).

In the first two weeks, each group must choose two preferred projects (a blue one and a red one) and send them to Giora (giorau@post.tau.ac.il).
We will try to match choices with availability, but in case of collisions, assignments will be based on the time at which the request reaches Giora's mailbox.

Tentative Timetable (subject to changes)

March 1	Lecture: Introduction to BioInformatics. Administratrivia.
March 8	Lecture: Projects' Overview
March 15	Project selection completed.
March 15	Lecture on CVS by Alon Shalita from the CS system staff (slides, summery)
March 22	Lecture on parallel processing and Condor by Edward Aronovich from the CS system staff (slides)
March 29	Outlines of all planned projects presented by students in class (10 minutes per group)
April 19	Lecture on databases and MySQL by Alon Shalita from the CS system staff
May 5	Lecture on databases, MySQL, and packaging by Alon Shalita from the CS system staff
May 31	Highly recommended (free, registration required): Annual Israeli BioInformatics day in the Crowne Plaza Hotel, Jerusalem
June 7	Presentation of completed projects in class (20 minutes per group). Will take place 11-13 and 14-16.
June 11	Last date for projects submission, including a detailed presentation of the projects to the course staff.

Assigned Projects

Giora Unger

Yaara Azaria, Maya Mograbi, Daniela Raijman	Extending linear separability tests from pairs to triplets and quadruples, and allowing violations
Eran Ophir, Amir Segall	MAD MEX - motif finding
Udi Altshuler, Elad Mazor	Extending linear separability tests from SVM to additional classifiers
Tamir Tuller
Tal Peled, Uri Zonnens	Simultaneous identification of duplication and lateral transfer
Igor Ulitsky, Dudu Burstein	A phylogenetic tree construction algorithm for complete genomes
Tal Tamir, Lior Gad	Gene prediction by spectral rotation measure and LZW compression algorithm
Erez Makabi, Eyal Megran	Constructing Bayes nets

Suggested Workshop Projects (List Subject to Change)

Phylogenetic Trees and Networks

1. Simultaneous identification of duplication and lateral transfer

Overview
This project is based on recent work by M.Hallet, J.Lagergren and A.Tofigh. The work introduce a combinatorial model that incorporate duplication events as well as lateral gene transfer event. The goal is to explain the deference between a gene tree an a species tree. In the project you will implement two algorithms from the article and apply it on synthetic and biologic data.

References

M.Hallet, J.Lagergren and A.Tofigh, Simultaneous identification of duplications and lateral transfer, 2004.

What to do

1. Reading and understanding the article of M.Hallet, J.Lagergren and A.Tofigh.
2. Implementation of two algorithms from the article.
3. Applying the methods on synthetic and biologic data.

2. Using Hadamard transform and the LogDeterminant for construction of phylogenetic trees.

Overview

The spectral analysis of sequences and distances data is Hadamard transform based method for phylogenetic tree construction. The LogDeterminant transform provides distances which allows the correct tree to be recovered consistency when sequences differ markedly in nucleotide frequencies. In this project you will implement and analyze those two methods for phylogenetic tree construction.

References

Steel,M.., Lockhart,P.J. and Penny,D. (1993) Confidence in evolutionary trees from biological sequence data. Nature, 364, 440–442.

M.D.Hendy. Spectral analysis of phylogenetic data. 1993.

M.D.Hendy, D.Penny, M.A.Steel. A discrete fourier analysis for evolutionary trees. 1993.

Additional (optional) References

Lockhart,P.J., Steel,M.A., Hendy,M.D. and Penny,D. Recovering evolutionary trees under a more realistic model of sequence

evolution. Mol. Biol. Evol., 11, 605–612, 1994

Steel,M., Huson,D. and Lockhart,P.J. (2000) Invariable sites models and their use in phylogeny reconstruction. Syst. Biol., 49, 225–232

D.Penny, M.Hasegawa, P.J.Waddel, M.D.Hendy. Mammalian evolution: timing and implications from using the LogDeterminant Transform for proteins of differing amino acid composition. 1999.

What to do

1. Reading and understanding the references
2. Implementation of two methods for reconstruction of phylogenetic tree, partial code will be given by Tamir.
3. Applying the methods on synthetic and biologic data, analyzing the results and checking the number of maximum points reached by those methods on biological data.

3. A phylogenetic tree construction algorithm for complete genomes.

Overview
In this project you will implement an new LZW based algorithm for generating a phylogenetic tree, this algorithm is different from other known method in two main point: it doesn’t restricted to any nucleotide - substitution model and it can be applied to complete genomes.

References

Http://www.Data-compression.com/vq.html.

More material will be given by Tamir.

What to do

1. Implementation of some LZW based algorithms for finding distances between genomes.
2. Using Matlab or any other public software for generating a phylogenetic tree from the distances matrix found in stage 1.
3. Download complete genomes from NCBI, applying and checking the method on those genomes.
4. More precise details will be given by Tamir.
.

Gene Expression Data and DNA Microarrays

DNA microarray data normalization tool

Overview
Gene expression data extracted from DNA microarrays undergo various manipulations, normalizations etc., before being used by algorithmic tools. There are numerous approaches and methods for normalizing such data, depending on the type of microarray (e.g. cDNA or Affymetrix chips) and other factors. In this project, you’ll write a platform that would facilitate several options for normalizing already-processed gene expression data.

References
O. Troyanskaya, M. Cantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. Altman. Missing value estimation methods for DNA microarrays. Bioinformatics, 17(6):520-525, 2001.

S. Dudoit, J. Fridlyand, and T.P. Speed. Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97(457):77-87, 2002.

M. Dettling and P. Buhlmann. Boosting for tumor classification with gene expression data. Bioinformatics, 19(9):1061-9, Jun 2003.

Giora Unger's M.Sc. thesis dissertation – “Linear Separability and Classifiability of Gene Expression Datasets”.

What to do
1. Implement a normalization platform, that would support:
    a. Imputation of missing data for cDNA datasets, using the kNN (k-Nearest-Neighbours) method.
       This method is presented in Troyanskaya et. al, is explained clearly in Dudoit et. al (section 3.2), and was used also in Dettling & Buhlmann.
    b. Normalizing Affymetrix datasets, in a manner similar to that described in Dettling & Buhlmann (for the “Leukemia” and
        “Estrogen and Nodal” datasets) and in Giora Unger’s M.Sc. thesis (appendix B). A skeleton code for that can be provided by Giora.
2. Support convenient user interface
3. Demonstrate the operation of the platform on several publicly available datasets (to be defined later with Giora)
4. Bonus: implement additional methods for missing data imputation, as described in Troyanskaya et. al.
Extending linear separability tests from SVM to additional classifiers (AdaBoost, naive Bayes, geometric Bayes etc.)
Overview
One of the most promising (and challenging) applications of gene expression data is classification of cancer tumors. In this project you’ll extend an existing classification tool, based on pairs of genes that exhibit certain unexpected geometric characteristics.

References
Giora Unger's M.Sc. thesis dissertation – “Linear Separability and Classifiability of Gene Expression Datasets”.

What to do
1. Read Giora’s M.Sc. thesis. Note, that basically 3 classifiers are used there – SVM, simple majority and weighted voting.
2. Study standard classification algorithms – AdaBoost, NaiveBayes, Geometric Bayes, Distance from Average and another one of your choice.
3. Integrate these classifiers into the code (to be provided be Giora).
4. Demonstrate the results.

Note: using existing, ready made code for the classifiers is highly encouraged, so that most of the work would be integrating them into the existing platform.
Extending linear separability tests from pairs to triplets and quadruples.
Overview
As briefly introduced in class, and detailed in Giora’s M.Sc. thesis, in many gene expression datasets an interesting geometric phenomenon was discovered: Many pairs of genes exhibit linear separability. In this project, you’ll extend the separability test carried out there to testing larger groups of genes – triplets and quadruples.

References
Giora's M.Sc. thesis dissertation – “Linear Separability and Classifiability of Gene Expression Datasets”.

What to do
1. Read Giora’s M.Sc. thesis. Note, that basically 3 classifiers are used there – SVM, simple majority and weighted voting.
2. Understand the algorithm and the relevant code (to be provided be Giora).
3. Implement the test for triplets and quadruples.
4. More precise details – to be finalized with Giora at a later stage.
A tool for adjoining and unifying gene expression datasets
Overview
One of the prominent problems with current gene expression datasets is that different laboratories generate data in different formats. It is often desired to “merge” several datasets together, thus creating one larger dataset. In this project you’ll write a tool for adjoining and unifying gene expression datasets. In addition, automatic tools for generating annotations based on genes’ IDs are sorely missing, making it much harder to infer biological data from algorithmic outputs.

References
CAMDA 2003 contest web page - http://www.camda.duke.edu/camda03/datasets/

What to do
1. Write a tool for unifying datasets.
2. Write a tool for generating a well-known IDs for a dataset (e.g.: accession number and locus-link), based on
Affymetrix probe IDs – to be elaborated by Giora at a later stage.
3. Download the four datasets from the CAMDA 2003 site.
4. Unify these four datasets (possibly only three of them – to be discussed with Giora at a later stage)
5. Demonstrate the ID extraction tool on several datasets (to be provided by Giora)

Analysis of Gene Structure, Introns and Exons

Gene prediction by spectral rotation measure and LZW compression algorithm.

Overview
One of the major problems in bioinforamtics is gene finding. In this project you will write a gene prediction software. Your algorithm will combine two methods, one is based on FFT and the other is based on the LZW compression algorithm.

References

D.Kotlar, Y.Lavner. Gene prediction by spectral rotation (SR) measure : a new method for identifying protein-coding regions, Genome Research 13:1930-1937, 2003

Http://www.Data-compression.com/vq.html.

What to do

1. Implement an LZW based algorithm for gene prediction.
2. Implement a combined LZW/FFT based algorithm for gene prediction by using neuron network/SVD or any other learning method (after discussing with the Tamir), the code for the FFT method will be given by the instructor.
3. Checking the performance of the method on several genomes and compare it to other methods.

Understanding Genomic Pathways

Efficient implementation of data structures and algorithms for Bayesian Networks

Overview
DNA hybridization array simultaneously measure expression level for thousands of genes. This measurements provide a “snapshot” of the transciption level within the cell, one of the methods for discovering interactions between genes base on multiple expression measurements is based on Bayesian network, which is a graph based model of joint multivariate probability distribution that capture properties of conditional dependence between variables. In this project you will write a code for learning a Bayesian network according to a known algorithm and you’ll check it on synthetic and real data

Remark: If needed, this project can be expanded to two projects, so that two groups may work (independently) on them.

References

D. Margarities and S. Thrun. Bayesian network induction via local Neighborhoods, 1999.

N.Friedman, M.Linial, I.Nachman, D.pe’er. Using Bayesian network for analyze expression data, 2000.

What to do

1. Write a code for a known Bayesian network inferring algorithm.
2. Test the performances of the algorithm on synthetic data, which you will sample from a known Bayesian network, and on real biological data.
3. Improving the performance of the algorithm on biological data (for more details please contact Tamir).

Tools for inferring pathways based on gene expression data
Overview
One of the key issues in computational biology in understanding complex biological systems using computational tools. A prominent example is inferring biological pathways based on various types of data, gene expression data being one of them. In this project you'll develop several tools for testing the validity of a given function, in light of data (input and output), for assessing the quality of the function and for expnading and improving it, and so on.

References
To be provided later.

What to do
This project will require larger amounts of research, and might be split into two projects, possibly of two groups working separately but interacting. The precise tasks to be implemented would be defined later.