The polynomial solution

Next: Bibliography Up: Sequencing by Hybridization Previous: The naive approach

The polynomial solution

Luckily, a polynomial solution for this problem exists, due to Pevzner [4]. Define another directed graph G=(V,E). This time the vertices will be (k-1)-mers. An edge e=(v₁,v₂) is introduced if the first k-2 characters of v₁ match the last k-2 characters of v₂, and the concatenation of the first character of v₁, with the k-2 common characters and the last character of v₂ form a k-mer that was reported present in the sequence. This graph is called the de-Bruijn graph of the sequence.

**Figure 12.10:** The (k-1)-mer graph constructed given the k-spectrum (k=3) in equation 12.1.
$\resizebox*{5in}{4in} {\includegraphics{lec12_fig/euler.eps}}$

For instance, if

(1)

then the corresponding graph will be the one in figure 12.10. The mathematical problem here is to find an Euler path, that is a path that uses each edge once, and only once. For example, the sequence

T=ACAAACGCACTTAA

is a solution to the instance whose graph is depicted in figure 12.10, corresponding to the Euler path

$\begin{displaymath}AC \rightarrow CA \rightarrow AA \rightarrow AA \rightarrow A... ...AC \rightarrow CT \rightarrow TT \rightarrow TA \rightarrow AA \end{displaymath}$

It should be noted that for this construction, it is very important to know whether a given k-mer occurs more than once in the target sequence. For instance, if ACA occurs two times in S, then there should be two edges between AC and CA. Otherwise, our solution will not be correct.

While this solution is mathematically elegant, there are several problems with using it in true biological context:

1.: For some graph configurations, there is more than once Euler path. In such cases we will not be able to reconstruct the sequence. For an example of such a graph, see figure 12.10.
2.: As in all biological experiments, the spectrum we measure contains a large proportion of errors. This solution is not robust enough to handle them.
3.: A related problem is that of edge multiplicity. We can consider ourselves lucky to know with certainty whether a certain k-mer occurs in our sequence. In most cases we have no way of knowing exactly how many times it occurs.

The only problem which we shall address here is the first one - the problem of the graph representing more than one possible sequence. Figure 12.11 shows the probability that a random string S cannot be uniquely reconstructed from its k-spectrum.

**Figure 12.11:** P(N,k) is the probability that for a random string S of length N there exists a sequence S₀, whose k-spectrum equals to S's k-spectrum.
$\resizebox*{6.5in}{5in} {\includegraphics{lec12_fig/props.eps}}$

Some other chip designs achieve a somewhat better result, but these designs are only theoretical. They are usually very difficult or impossible to manufacture, and cannot be used in true biological context.

Sadly, then, sequencing based on hybridization is not a real alternative to standard sequencing.

Next: Bibliography Up: Sequencing by Hybridization Previous: The naive approach

Peer Itsik
2001-02-01