Tel Aviv University Department of Computer Science
Fall 2001-02
Algorithms in Molecular Biology 0368.4020.01
Ron Shamir
Class bulletin-board
Sun, 24 March 2002, Roded: checked ex4 on my desk
Thu, 7 March 2002, Roded: checked ex5 on my desk
Thu, 14 Feb 2002, Roded: Ex5 Q2 fixed
Fixed a typo in Q2. Updated Ex5 and lec09 scribe accordingly.
Sun, 3 Feb 2002, Roded: Deadline of Ex5 extended to 24.2
Wed, 30 Jan 2002, Roded
5: a safe reversal does not form a new unoriented component.
6: r=1,2
8: these two parts are the requirements from the ordering. They are
very connected.
From: Gad Kimmel
Subject: Ex5
5. (b) - what is the definition of "safe" in this question ? (I found some
different definitions in the literature)
6. It is given that the "(k-1) mer in position i_r equals that in position
j_r", - I didn't understand what is the exact definition of i_r and j_r
(what is _r ?).
8. (b) We need to find an algorithm that maximize Sigma S(Vi, Vi+1). Each
of the leaves has a vector. What is the connection to the fact that the
leaves are a part of a tree? (I assumed (a) and (b) are different
questions, but both relates to a tree).
Sun, 20 Jan 2002, Roded - Ex4 clarifications + new deadline
New deadline: 6.2.02.
Some more clarifications:
1. Bonus credit for applying some heuristic for adaptively
updating the length of the HMM during the learning process.
Note that initializing the length according some muliple-alignment
is possible only when you run the program manually, and not
in its 'automatic' mode (i.e., when I run it).
2. The sub-classification should be done automatically without any
prior biological knowledge. This 'miracle' happens since if a family
contains several sub-families then modeling those sub-families
will increase the likelihood of the data, and therefore
the EM will detect them when trying to maximize the likelihood of the data
if the signal is strong enough.
Tue, 15 Jan 2002, Roded
Q1: Profile alignment means using the hmm for building the alignment,
so there is no BLAST involved. In general everything should be done by the
program, other than of course literature search.
Q2: Linux is the same as Unix.
From: Yevgeny Shrayber
Subject: Ex4
Which part of 1(d),1(e) should be performed by the program we write, and
what part can be done manually - i.e. using multiple-alignment
engines on the Internet, BLAST, PubMed, and so on?
2. About the OS platform... can Linux be used instead?
Sun, 13 Jan 2002, Roded - Ex4 Update
I updated Ex4. It can be retrieved from the web.
Main changes: 1) I clarified the sub-family classification.
2) I clarified the input/output required from the program.
Note, the program should compile on UNIX!!!.
Mon, 7 Jan 2002, Roded - Ex4
Some clarifications regarding the ex. in response to questions I got:
1. A good reference to buidling the required HMM is Durbin's book, chapter 6.
2. The data can be collected in two ways: strating from some sequence
search for its homologs, or starting from a set of sequences appearing in
a paper search for homologs. The number of collected toxins should
be around 100. The number of globins - over 400.
Mon, 31 Dec 2001, Roded - clustalw
For some reason, the EBI server doesn't accept clustalw forms.
Please use http://bioweb.pasteur.fr/seqanal/interfaces/clustalw.html.
Sun, 30 Dec 2001, Roded
Q1c: The idea is to train on the whole dataset and let the HMM find the
sub-classification on its own.
Q1d: You have to search pubmed.
From: Gad Kimmel
Subject: Ex4
I have 2 questions regarding the program exercise:
Question 1, c - What is the meaning of dividing to k sub-families.
Does it mean - to divide them randomly to k sub-families and train each
family separately?
Question 1, d - Where can I find the relevant multiple alignments in the
literature?
Mon, 19 Dec 2001, Roded
The meaning is as an ordered set of t^2 values.
From: Inol Axel
Subject: Ex2
In ex. 2 Q8 appears the sentence "The value of R is it's t^2 entries".
Is the meaning of that is the sum of these entries? the entries as a set
(and then entries with same values counted once and position is ignored)?
or as an ordered group?
Mon, 19 Dec 2001, Roded
Subject: Scribes
To all scribe writers: In order for \scribebased to be effective,
\scribeby must come after it.
Mon, 17 Dec 2001, Roded
In response to a question - item 8a is correct.
Mon, 17 Dec 2001, Roded
Indeed interface has changed.
Play with the 'Expect' field in order to catch all perfect matches.
From: Nelly
Subject: Ex2
I have a question about ex.2 q.10:
It is written there: Check the box for Perform ungapped alignment. I
didn't find this in blastn site.
Wed, 12 Dec 2001, Amos
Re Ex2 question 1.1 : We are interested in a non-trivial match (one
that avoids complete overlap).
Thu, 22 Nov 2001, Book containing scribes of last course is available at
Safrut Zola
Tue, 13 Nov 2001, Roded
Q1: Start codon is AUG (the input is mRNA). By definition in a reading
frame each triple is treated as a codon (no other nucs). You may not assume
that a start codon comes right after a stop codon (in an operon).
Q2: Left as an ex.
From: Nelly
Subject: Ex1
1.This question refers to one reading frame (one of six) : Can I make these
assumptions: 1.1.The start codon is "ATG". 1.2.The gene of prokayotes includes
only triplets of encoding nucleotides and no others (singles or doubles)
nucleotides. For example, is this gene in operon mRNA molecule is legal:
"[ATG]A[CAT][UGA]" (where "ATG" is the start codon,"UGA" - stop codon,
"CAT" is codon included in the gene and A is a single nucleotide which
not encodes). 1.3. Can I assume that after each stop codon the new gene
starts, i.e. the next triplet of nucleotides is the start codon of the
next gene? For example, is this seq. is legal or not: "...[UGA][CAT][ATG]..."(where
"UGA" - stop codon of the gene and "ATG" is the start codon of the next
gene)? Note: I know that prokayotes don't include introns and exons, so
it has to include only encoding codons without "junk"(so after the stop
codon of the gene the start of the new gene must appear and nothing can
appear between them.Is it right?) I read in book that it is not allways
this way. 2.Now the question about the downstream of the DNA: If the start
codon is "ATG", what is the start codon in the second strand of the DNA?
Is this "TAC" (complement of "ATG") and then two genes start simultaniously.
Or maybe it is "ATG" too. Or maybe it is complement and reverse of "ATG"
- "CAT"? The same questions about stop codons...
Sun, 11 Nov 2001, Roded
Submission of Ex1: delayed till strike ends. You can submit both at the
lecture and directly to me (my mailbox is on floor 1).
Q2: genes are recognized (in here) only by their start and end codons.
No two genes in the same reading frame can share any codon (meaning they
do not overlap)
Q3: The linear alg. returns yes/no.
From: Yevgeny Shrayber
Subject: Ex1
Hi, I have some questions regarding the exercise 1... first, administrative
information: do we submit the exercise at the lecture? if a strike continues
and there won't be a lecture - do we submit it at the next lecture? can
you please add a link on the course webpage that will inform us atleast
12 hours before the lecture -will you give a lecture or not. speaking of
the webpage - I didn't succeed to dowlnoad even one file among all the
links in the table... can you do something about it? second, I have an
understanding problem in question 2 (probably I didn't get something with
the biology introduction): can I recognize the start and end of genes only
by "start" and "stop" substrings (in the algorithm)? does each gene always
start with a "start" substring and stop with a "stop" substring? when genes
overlap, does it mean that first I see a "start" of the first gene, then
a "start" of a second one, and only then "stops" will follow in any order?
may one "start" start both overlapping genes? may one "stop" stop more
then one gene? when you are saying "two or more genes may overlap, but
not in the same reading frame", do you mean that if two genes overlap,
then they must start and stop in two different frames? in question 3: can
I assume that the linear algorithm I have (which determines whether S is
a subsequence of T) returns an index in T where the similarity starts?
Sorry, I know it's a lot of questions, but as there is no other way but
to bother you... Thanks, Yevgeny.
For any comments about this page, please contact
Roded Sharan or Ron
Shamir
Back to course homepage