Luckily, a polynomial solution for this problem exists, due to Pevzner [4]. Define another directed graph G=(V,E). This time the vertices will be (k-1)-mers. An edge e=(v1,v2) is introduced if the first k-2 characters of v1 match the last k-2 characters of v2, and the concatenation of the first character of v1, with the k-2 common characters and the last character of v2 form a k-mer that was reported present in the sequence. This graph is called the de-Bruijn graph of the sequence.
It should be noted that for this construction, it is very important to know whether a given k-mer occurs more than once in the target sequence. For instance, if ACA occurs two times in S, then there should be two edges between AC and CA. Otherwise, our solution will not be correct.
While this solution is mathematically elegant, there are several problems with using it in true biological context:
The only problem which we shall address here is the first one - the problem of the graph representing more than one possible sequence. Figure 12.11 shows the probability that a random string S cannot be uniquely reconstructed from its k-spectrum.
|
Some other chip designs achieve a somewhat better result, but these designs are only theoretical. They are usually very difficult or impossible to manufacture, and cannot be used in true biological context.
Sadly, then, sequencing based on hybridization is not a real alternative to standard sequencing.