Spring 2009

Computational Genomics

0382.3102

http://www.cs.tau.ac.il/~bchor/CG06/comp-genom.html

Instructor: Prof. Benny Chor

Teaching Assistant: Dr. Igor Ulitsky

Lectures: Wednsdays 9:00-12:00, Schreiber 008

Recitations: Thursdays 12:00-13:00, Schreiber 006

Contact Info:

		Email	Phone	Office	OfficeHours
Instructor	Prof. Benny Chor	benny AT cs.tau.ac.il	640-5977	Schreiber 223	e-appointment
Teaching Assistant	Dr. Igor Ulitsky	Ulitskyi AT gmail.com	640-5394	Schreiber 011	e-appointment

Course Bulletin Board

Date posted	Contents
March 22	Assignment 1 has been posted (see link below)
April 11	Assignment 2 has been posted (see link below)
May 19	Assignment 3 has been posted (see link below)

Course Outline

The Human Genome Project has completed the sequencing of the human DNA. Other genome projects of various organisms (bacteria, fungi, worms, rice, wheat, flies, chicken, mice and primates) are either completed or in the process of advanced sequencing. New bio-technologies produce huge amounts of biological data, whose manual analysis not possible. These advances have pushed for a new scientific discipline of bioinformatics, which integrates computer science, statistics, mathematics, and various biological fields. In this introductory course we concentrate on the algorithmic aspects of various computational tasks with biological origins. We will explore their computational complexity, and ways to solve them even if they are computationally intractable, such as approximation algorithms and heuristics.

Course topics include comparisons and alignments of biological sequences; Suffix trees and their uses for analyzing very long sequences (like complete genomes); Reconstruction of evolutionary trees using various optimization criteria, such as maximum parsimony, maximum likelihood, and distance criteria; Physical mapping of genomes; Genome alignment; Probabilistic models, such as PSSM (Position Specific Scoring Matrix) and HMM (Hidden Markov Models); HMM algorithms (Viterby, Expectation Maximization); Biological applications of the probabilistic models (e.g. gene prediction, protein classification, and sequence alignment); Analyzing biological networks; DNA microarrays and their uses; Clustering gene expression data; Classification using SVM (Support Vector Machines), and applications to clinical diagnosis; Metagenomics.

Tantative Course Plan (subject to changes):

Lecture	Date	Subject	Reference(s)	Recitation
1	4/3/09	Introduction and administrativia. Pairwise Sequence Alignment: global & local.	Kanehisa, ch. 3 Gusfield, ch. 11. Durbin et. al., ch. 2.	no recitation this week
2	11/3	Similarity and distance; Affine gap penalties; Fasta and Blast heuristics.	Durbin et. al., ch. 2. Gusfield, ch. 11	More pairwise sequence allignments
3	18/3	Scoring functions: PAM and BLOSUM Multiple sequence alignment	Durbin et. al., 2.7 Gusfield, ch. 11,15	MSA
4	25/3	Suffix trees - Introduction (Dan Geiger pdf) Suffix trees and arrays - (Fernndez-Baca ppt) Phylogenetic trees based on whole genomes	Gusfield, ch. 5,6,7	Testing multiple hypotheses Suffix trees
5	1/4 (no kidding)	Phylogenetic trees reconstruction: Distance methods Maximum parsimony	Durbin et. al.,7.1, 7.2, 7.4 Gusfield, ch. 17	M L and AML in phylogenetic trees
6	21/4	Maximum parsimony cont., including proof of hardness. Maximum likelihood (1) Maximum likelihood (2) - a status report	Durbin et. al., ch. 7 and 8 Gusfield, ch. 17	PSSMs Gene Finding
7	6/5	1) Introduction to discrete Markov Models 2) Hidden Markov models (HMM): the three problems 3) Baum's Forward-Backwards Algorithm (board presentation, following Rabiner's tutorial).	Rabiner's Tutorial Survey from Univ. of Leeds Durbin et al. , ch. 3	HMMs for gene finding
8	13/5	Viterbi's algorithm (board presentation, following Rabiner's tutorial).. Using EM to optimize HMM parameters (board presentation, following Rabiner's tutorial). Determining protein families by domains: PSSMs, HMMs, and the Pfam database (see ppt from Leuven)	Dan Geiger (Technion) EM slides Bilmes' gentle tutorial to the EM algorithm Durbin et al., ch. 3	EM (expectation maximization) and ML (maximum likelihood)
9	20/5	Random graphs basics: Introduction to biological networks	ppt by Bill Aiello et al.	No recitation: Student's day
10	27/5	Intro to DNA microarrays	DNA microarrays demo Some computational questions	No recitation: Shavuot vacation
11	3/6	More microarray analysis: Clustering; Diagnosis.	More computational questions René Bernards pdf on improved cancer diahnostics	Differential expression
12	10/6	Pluto's intro to machine learning. Differential gene expression and class discovery (based on a presentation of Dr. Zohar Yakhini) Linear separability of gene expression pairs (based on joint work with (Giora Unger)	Pluto's intro to machine learning (3.5GB file): Based on ch. 1 from Schlkopf and Smola's book
13	17/6	DNA sequencing Pizza, pasta, and beer	DNA Sequencing (Batzoglou ppt)

Prerequisites

The course is officially opened to 3rd year students of the bioinformatics track, to 3rd year and graduate Computer Science students. The algorithms course (formerly efficiency of computations) is a prerequisite, and the computability course (computational

models) should at least be taken concur8rently with this course. Students from other disciplines are welcome to take the course, but are encouraged to first contact the instructor or teaching assistant. Some background in algorithms, probability, and programming is required.(In the previous times this course was given by this instructor, a number of life science students completed it successfully).

Administratrivia

Following last years’ successful experience, in addition to a 3 hours weekly lecture, there will also be a 1 hour weekly recitation, given by the teaching assistant. Attending this recitation is absolutely optional, but is strongly encouraged.

BTW this is also the case regarding the lectures: If you can understand the material, answer the home assignments (on your own) and pass the exam in flying colors without attending classes, then surely windsurfing or boogy-boarding is a better use of your time.)

Grades in the course will be based on five-six problem sets (20-30% of total grade) and a final exam (70-80%). Intelligent participation in class may yield a small (unspecified, but surely non-negative) bonus.

Homework submission in singles or pairs (no triplets, quartets, quintets etc.). There will be a total of 3-5 problem sets, involving both "dry" assignments and "wet" ones. The wet parts will require understanding and running existing software, but not writing any code. Homework should be done and written independently. External sources (books, journal articles, web pages) can be used but should be clearly quoted. Research done at the Technion has shown a significant correlation between independent work and final performance.

Moed A from 2005/6 can be found here

Problem Sets

Assignment 1 Posted March 22nd. Due April 5th.

Assignment 2 Posted April 11th. Due May 1st (except wet problem, due later).

Assignment 3 Posted May 19th. Due June 7th.

Assignment 4

Books

Biological Sequence Analysis, R. Durbin et al. , Cambridge University Press, 1998.

Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, D. Gusfield, Cambridge University Press, 1997.

Post-genome Informatics, M. Kanehisa , Oxford University Press, 2000.

Introduction to Computational Molecular Biology, J. Meidanis and J. Setubal, Brooks/Cole Pub Co., 1997.

Computational Molecular Biology : An Algorithmic Approach, P. Pevzner, MIT Press, 2000.

Introduction to Bioinformatics Algorithims, Neil C. Jones, Pavel A. Pevzner, MIT Press, 2004.

Learning with Kernels: Support Vector Machines, Regularization, Optimization and Beyond, Bernhard Schlkopf and Alex Smola, MIT Press, 2002. Available on-line: Preface, Section 1.1, Section 1.2, and more.

Biology Background Books

The Cartoon Guide to Genetics, by Larry Gonick and Mark Wheelis, Harper Perennial, 1991 (I have not read this book, but I was told it is gives a good popular overview. Some students complained the book spends way too much time before getting down to relevant business).

Other On-line Courses

Dan Geiger course: Algorithms in Computational Biology , Technion, Spring 2008.

Lior Pachter course: Algebraic Statistics for Computational Biology, UC Berkeley, Fall 2004.
Ron Shamir course notes: Computational Genomics, Tel-Aviv University, Fall 2003/4

Nir Friedman's course slides: Computational Molecular Biology the Hebrew University, Jerusalem, Fall 1999.
Ruzzo and Tompa's course notes: Algorithms in Molecular Biology University of Washington, Seattle.

Other Interesting Resources (not directly course related)

Mother Jones: Alternative views on some world affairs

The Ig Nobel Prizes

Snails are faster than ADSL (see also here, and pay special attention to the apparatus)