Next: Signal models
Up: GENSCAN
Previous: Transition probabilities
Different functional
units on a gene have vastly different lengths. For example, an
average internal exon is about 150bp long, while introns of the
order of 1Kbp length are not uncommon. Thus, in our probabilistic
model of gene structure, different states need to have different
length distributions.
Intron lengths are known to vary dramatically with the C+G content
category. For example, the mean intron length for category I
( < 43% C+G) of the training set is 2069bp as opposed to only
518bp for category IV ( > 57 % C+G) (see Figure
). Thus, the program uses separate
distributions for intron states in each category.
The learning set shows quite different length distributions for
initial exons, internal exons and terminal exons. Consequently,
different distributions are used for them. It is important to note
here that the length of an internal exon has to be consistent with
the phase of its adjacent introns. For example, if the preceding
state is I2 and the succeeding state is I1, then the
generated internal exon length (for state E2 in this case) must
be 3n+2 for some n. n is therefore generated randomly
according to the length distribution and then a string of length
3n+2 is generated according to the string generating model for
that state.
For the 5' UTR and 3' UTR states, geometric distributions with
mean values of 769bp and 457bp are used.
Next: Signal models
Up: GENSCAN
Previous: Transition probabilities
Peer Itsik
2000-12-25