It is known that due to biochemical considerations that CpG,
the pair of nocleotides C and G, appearing successively, in this
order, along one DNA starnd, is relatively rare in DNA sequences,
excluding particular sub-sequences, which are several hundreds of
nucleotides long, where the couple CpG is more frequent. These
sub-sequences, called CpG islands, are known to appear in
the biologically more significant parts of the genome. The ability
to identify CpG islands in the DNA will therefore help us spot the
more significant regions of interest along the genome.
Problem 6.1
Identifying a CpG island.
INPUT: A short DNA sequence
(where
).
QUESTION: Decide whether X is a CpG island.
We can approach such problems using a Markov chain model.
Let us denote for each
the transition
probability:
(6.1)
We assume that
is a random process with a memory of
length 1, i.e., the value of the random variable xi depends
only on its predecessor xi-1. Formally we can write:
(6.2)
The probability of the whole sequence X will therefore be:
(6.3)
We can also add fictitious
and
symbols to simplify the formula, where
is the background probability of
the symbol s. Hence:
(6.4)
Let
a+st denote the transition probability of
inside a CpG island and let
a-st denote the
transition probability outside a CpG island (see table
6.1 for the values of these
probabilities, taken from [4]
). We can give a logarithmic likelihood score for the
sequence X:
(6.5)
The higher this score, the more likely is that X is a CpG
Island.
Table 6.1:
Transition probabilities inside/outside a CpG island
+
A
C
G
T
-
A
C
G
T
A
0.180
0.274
0.426
0.120
A
0.300
0.205
0.285
0.210
C
0.171
0.368
0.274
0.188
C
0.322
0.298
0.078
0.302
G
0.161
0.339
0.375
0.125
G
0.248
0.246
0.298
0.208
T
0.079
0.355
0.384
0.182
T
0.177
0.239
0.292
0.292
Problem 6.2
Locating CpG islands in a DNA sequence.
INPUT: A long DNA sequence
.
QUESTION: Locate the CpG islands within X.
A naive approach for solving this problem will be to extract a
sliding window
of a given length
(where
,
usually several hundreds long, and
)
to the sequence and calculate
Score(Xk)
for each one of the resulting sub-sequences. Sub-sequences that
receive positive scores are potential CpG islands.
The main disadvantage in this algorithm is that we have no
information about the lengths of the islands, while the algorithm
suggested above assumes that those islands are at least
nucleotides long. Should we use a value of
which is too
large, the CpG islands would be short sub-strings of our windows,
and the score we give those windows may not be high enough. A
better approach to such problems is described in the following
section.
Next:Hidden Markov Models Up:Hidden Markov Models Previous:Hidden Markov ModelsItshack Pe`er 1999-01-24