Using codon frequencies

Next: Detection of Promoter Regions Up: Detection of Coding Regions Previous: ORFs as Markov chains

Using codon frequencies

In the model described above the probability of a codon occurrence depends on the preceding codon. We now consider a simpler model in which successive codons are independent. Let f_abc denote the frequency with which the codon abc occurs in a coding region. Given a coding sequence $a_1,b_1,c_1,a_2,b_2,c_2,\ldots ,a_{n+1},b_{n+1}$ with an unknown reading frame, the probability of observing the sequence of n codons appearing in the reading frame starting with a₁b₁c₁ is

$\begin{displaymath}p_1=f_{a_1b_1c_1}\times f_{a_2b_2c_2}\times\ldots\times f_{a_nb_nc_n} \end{displaymath}$

Similarly, the probability of observing the n codons in the second and third coding frames are:

$\begin{eqnarray*}p_2&=&f_{b_1c_1a_2}\times f_{b_2c_2a_3}\times\ldots\times f_{b... ...}\times f_{c_2a_3a_3}\times \ldots\times f_{c_na_{n+1}b_{n+1}} \end{eqnarray*}$

Let P_i denote the probability of the ith reading frame being the coding reading frame (assuming the region is coding). P_i can be calculated as follows:

$\begin{displaymath}P_i=\frac{p_i}{p_1+p_2+p_3} \end{displaymath}$

The above computation can be used in a search algorithm as follows: Slide a window of size n along the sequence, and compute P_i for each start position of the window. The Codon Preference program, which is part of the GCG library, implements this method.

**Figure:** Results of codon preference program [1].

Figure

shows a the plot of log(P/1-P), which is the log likelihood, for the three reading frames. Each point represents the score for a 25 codon window around it. The actual genes are plotted as rectangles at the bottom. We can see that in the reading frame matching the upper plot, the genes are clearly recognized.

**Figure:** Results of codon preference program - 3^rd position bias [1].

Figure

shows the plot of a program using only the 3^rd position bias information. These methods depend on the accuracy of the codon frequency statistics of already found genes. The algorithm will also have difficulty in detecting horizontal gene transfer and other causes of heterogeneity.

Next: Detection of Promoter Regions Up: Detection of Coding Regions Previous: ORFs as Markov chains

Peer Itsik
2000-12-25