Tel Aviv University Department of Computer Science

Fall 2001-02
Algorithms in Molecular Biology 0368.4020.01
Ron Shamir

Class bulletin-board

Sun, 24 March 2002, Roded: checked ex4 on my desk

Thu, 7 March 2002, Roded: checked ex5 on my desk

Thu, 14 Feb 2002, Roded: Ex5 Q2 fixed

Fixed a typo in Q2. Updated Ex5 and lec09 scribe accordingly.

Sun, 3 Feb 2002, Roded: Deadline of Ex5 extended to 24.2

Wed, 30 Jan 2002, Roded

5: a safe reversal does not form a new unoriented component.
6: r=1,2
8: these two parts are the requirements from the ordering. They are very connected.

From: Gad Kimmel

Subject: Ex5

5. (b) - what is the definition of "safe" in this question ? (I found some different definitions in the literature)
6. It is given that the "(k-1) mer in position i_r equals that in position j_r", - I didn't understand what is the exact definition of i_r and j_r (what is _r ?).
8. (b) We need to find an algorithm that maximize Sigma S(Vi, Vi+1). Each of the leaves has a vector. What is the connection to the fact that the leaves are a part of a tree? (I assumed (a) and (b) are different questions, but both relates to a tree).

Sun, 20 Jan 2002, Roded - Ex4 clarifications + new deadline

New deadline: 6.2.02.
Some more clarifications:
1. Bonus credit for applying some heuristic for adaptively updating the length of the HMM during the learning process. Note that initializing the length according some muliple-alignment is possible only when you run the program manually, and not in its 'automatic' mode (i.e., when I run it).
2. The sub-classification should be done automatically without any prior biological knowledge. This 'miracle' happens since if a family contains several sub-families then modeling those sub-families will increase the likelihood of the data, and therefore the EM will detect them when trying to maximize the likelihood of the data if the signal is strong enough.

Tue, 15 Jan 2002, Roded

Q1: Profile alignment means using the hmm for building the alignment, so there is no BLAST involved. In general everything should be done by the program, other than of course literature search.
Q2: Linux is the same as Unix.

From: Yevgeny Shrayber

Subject: Ex4

Which part of 1(d),1(e) should be performed by the program we write, and what part can be done manually - i.e. using multiple-alignment engines on the Internet, BLAST, PubMed, and so on?
2. About the OS platform... can Linux be used instead?

Sun, 13 Jan 2002, Roded - Ex4 Update

I updated Ex4. It can be retrieved from the web. Main changes: 1) I clarified the sub-family classification. 2) I clarified the input/output required from the program. Note, the program should compile on UNIX!!!.

Mon, 7 Jan 2002, Roded - Ex4

Some clarifications regarding the ex. in response to questions I got:
1. A good reference to buidling the required HMM is Durbin's book, chapter 6.
2. The data can be collected in two ways: strating from some sequence search for its homologs, or starting from a set of sequences appearing in a paper search for homologs. The number of collected toxins should be around 100. The number of globins - over 400.

Mon, 31 Dec 2001, Roded - clustalw

For some reason, the EBI server doesn't accept clustalw forms. Please use http://bioweb.pasteur.fr/seqanal/interfaces/clustalw.html.

Sun, 30 Dec 2001, Roded

Q1c: The idea is to train on the whole dataset and let the HMM find the sub-classification on its own.
Q1d: You have to search pubmed.

From: Gad Kimmel

Subject: Ex4

I have 2 questions regarding the program exercise: Question 1, c - What is the meaning of dividing to k sub-families. Does it mean - to divide them randomly to k sub-families and train each family separately? Question 1, d - Where can I find the relevant multiple alignments in the literature?

Mon, 19 Dec 2001, Roded

The meaning is as an ordered set of t^2 values.

From: Inol Axel

Subject: Ex2

In ex. 2 Q8 appears the sentence "The value of R is it's t^2 entries". Is the meaning of that is the sum of these entries? the entries as a set (and then entries with same values counted once and position is ignored)? or as an ordered group?

Mon, 19 Dec 2001, Roded

Subject: Scribes

To all scribe writers: In order for \scribebased to be effective, \scribeby must come after it.

Mon, 17 Dec 2001, Roded

In response to a question - item 8a is correct.

Mon, 17 Dec 2001, Roded

Indeed interface has changed. Play with the 'Expect' field in order to catch all perfect matches.

From: Nelly

Subject: Ex2

I have a question about ex.2 q.10: It is written there: Check the box for Perform ungapped alignment. I didn't find this in blastn site.

Wed, 12 Dec 2001, Amos

Re Ex2 question 1.1 : We are interested in a non-trivial match (one that avoids complete overlap).

Thu, 22 Nov 2001, Book containing scribes of last course is available at Safrut Zola

Tue, 13 Nov 2001, Roded

Q1: Start codon is AUG (the input is mRNA). By definition in a reading frame each triple is treated as a codon (no other nucs). You may not assume that a start codon comes right after a stop codon (in an operon).
Q2: Left as an ex.

From: Nelly

Subject: Ex1

1.This question refers to one reading frame (one of six) : Can I make these assumptions: 1.1.The start codon is "ATG". 1.2.The gene of prokayotes includes only triplets of encoding nucleotides and no others (singles or doubles) nucleotides. For example, is this gene in operon mRNA molecule is legal: "[ATG]A[CAT][UGA]" (where "ATG" is the start codon,"UGA" - stop codon, "CAT" is codon included in the gene and A is a single nucleotide which not encodes). 1.3. Can I assume that after each stop codon the new gene starts, i.e. the next triplet of nucleotides is the start codon of the next gene? For example, is this seq. is legal or not: "...[UGA][CAT][ATG]..."(where "UGA" - stop codon of the gene and "ATG" is the start codon of the next gene)? Note: I know that prokayotes don't include introns and exons, so it has to include only encoding codons without "junk"(so after the stop codon of the gene the start of the new gene must appear and nothing can appear between them.Is it right?) I read in book that it is not allways this way. 2.Now the question about the downstream of the DNA: If the start codon is "ATG", what is the start codon in the second strand of the DNA? Is this "TAC" (complement of "ATG") and then two genes start simultaniously. Or maybe it is "ATG" too. Or maybe it is complement and reverse of "ATG" - "CAT"? The same questions about stop codons...

Sun, 11 Nov 2001, Roded

Submission of Ex1: delayed till strike ends. You can submit both at the lecture and directly to me (my mailbox is on floor 1).
Q2: genes are recognized (in here) only by their start and end codons. No two genes in the same reading frame can share any codon (meaning they do not overlap)
Q3: The linear alg. returns yes/no.

From: Yevgeny Shrayber

Subject: Ex1

Hi, I have some questions regarding the exercise 1... first, administrative information: do we submit the exercise at the lecture? if a strike continues and there won't be a lecture - do we submit it at the next lecture? can you please add a link on the course webpage that will inform us atleast 12 hours before the lecture -will you give a lecture or not. speaking of the webpage - I didn't succeed to dowlnoad even one file among all the links in the table... can you do something about it? second, I have an understanding problem in question 2 (probably I didn't get something with the biology introduction): can I recognize the start and end of genes only by "start" and "stop" substrings (in the algorithm)? does each gene always start with a "start" substring and stop with a "stop" substring? when genes overlap, does it mean that first I see a "start" of the first gene, then a "start" of a second one, and only then "stops" will follow in any order? may one "start" start both overlapping genes? may one "stop" stop more then one gene? when you are saying "two or more genes may overlap, but not in the same reading frame", do you mean that if two genes overlap, then they must start and stop in two different frames? in question 3: can I assume that the linear algorithm I have (which determines whether S is a subsequence of T) returns an index in T where the similarity starts? Sorry, I know it's a lot of questions, but as there is no other way but to bother you... Thanks, Yevgeny.

For any comments about this page, please contact Roded Sharan or Ron Shamir
Back to course homepage

Fall 2001-02 Algorithms in Molecular Biology 0368.4020.01 Ron Shamir

Class bulletin-board

Sun, 24 March 2002, Roded: checked ex4 on my desk

Thu, 7 March 2002, Roded: checked ex5 on my desk

Thu, 14 Feb 2002, Roded: Ex5 Q2 fixed

Sun, 3 Feb 2002, Roded: Deadline of Ex5 extended to 24.2

Wed, 30 Jan 2002, Roded

From: Gad Kimmel

Subject: Ex5

Sun, 20 Jan 2002, Roded - Ex4 clarifications + new deadline

Tue, 15 Jan 2002, Roded

From: Yevgeny Shrayber

Subject: Ex4

Sun, 13 Jan 2002, Roded - Ex4 Update

Mon, 7 Jan 2002, Roded - Ex4

Mon, 31 Dec 2001, Roded - clustalw

Sun, 30 Dec 2001, Roded

From: Gad Kimmel

Subject: Ex4

Mon, 19 Dec 2001, Roded

From: Inol Axel

Subject: Ex2

Mon, 19 Dec 2001, Roded

Subject: Scribes

Mon, 17 Dec 2001, Roded

Mon, 17 Dec 2001, Roded

From: Nelly

Subject: Ex2

Wed, 12 Dec 2001, Amos

Thu, 22 Nov 2001, Book containing scribes of last course is available at Safrut Zola

Tue, 13 Nov 2001, Roded

From: Nelly

Subject: Ex1

Sun, 11 Nov 2001, Roded

From: Yevgeny Shrayber

Subject: Ex1

Fall 2001-02
Algorithms in Molecular Biology 0368.4020.01
Ron Shamir