Next: BLAST - Basic Local Up: No Title Previous: Introduction - Sequence Alignment

FASTA

The FASTA algorithm is a heuristic method for string comparison. It was developed by Lipman and Pearson in 1985 [6] and further improved in 1988 [7].

FASTA compares a query string against a single text string. When searching the whole database for matches to a given query, we compare the query using the FASTA algorithm to every string in the database.

When looking for an alignment, we might expect to find a few segments in which there will be absolute identity between the two compared strings. The algorithm is using this property and focuses on these identical regions.

The stages in the FASTA algorithm are as follows:

1.

We specify an integer parameter called ktup (short for k respective tuples), and we look for ktup-length matching substrings of the two strings. The standard recommended ktup values are six for DNA sequence matching and two for protein sequence matching. The matching ktup-length substrings are referred to as hot spots. Consecutive hot spots are located along the dynamic programming matrix diagonals. This stage can be done efficiently by using a lookup table or a hash to store all the ktup-length substrings from one string, and then search the table with the ktup-length substrings from the other string.

2.

In this stage we wish to find the 10 best diagonal runs of hot spots in the matrix. A diagonal run is a sequence of nearby hot spots on the same diagonal (not necessarily adjacent along the diagonal, i.e., spaces between these hot spots are allowed). A run need not contain all the hot spots on its diagonal, and a diagonal may contain more than one of the 10 best runs we find.

In order to evaluate the diagonal runs, FASTA gives each hot spot a positive score, and the space between consecutive hot spots in a run is given a negative score that decreases with the increasing distance. The score of the diagonal run is the sum of the hot spots scores and the interspot scores. FASTA finds the 10 highest scoring diagonal runs under this evaluating scheme.

3.

A diagonal run specifies a pair of aligned substrings. The alignment is composed of matches (the hot spots) and mismatches (from the interspot regions), but it does not contain any indels because it is derived from a single diagonal. We next evaluate the runs using an amino acid (or nucleotide) substitution matrix, and pick the best scoring run. The single best subalignment found in this stage is called init₁. Apart from computing init₁, a filtration is performed and we discard of the diagonal runs achieving relatively low scores.

4.

Until now we essentially did not allow any indels in the subalignments. We now try to combine ``good'' diagonal runs from close diagonals, thus achieving a subalignment with indels allowed. We take ``good'' subalignments from the previous stage (subalignments whose score is above some specified cutoff) and attempt to combine them into a single larger high-scoring alignment that allows some spaces. This can be done in the following way:

We construct a directed weighted graph whose vertices are the subalignments found in the previous stage, and the weight in each vertex is the score found in the previous stage of the subalignment it represents. Next, we extend an edge from vertex u to vertex v if the subalignment represented by v starts at a lower row than where the subalignment represented by v ends. We give the edge a negative weight which depends on the number of gaps that would be created by aligning according to subalignment v followed by subalignment u. Essentially, FASTA then finds a maximum weight path in this graph. The selected alignment specifies a single local alignment between the two strings. The best alignment found in this stage is marked init_n. As in the previous stage, we discard alignments with relatively low score.

5.

In this step FASTA computes an alternative local alignment score, in addition to init_n. Recall that init₁defines a diagonal segment in the dynamic programming matrix. We consider a narrow diagonal band in the matrix, centered along this segment. We observe that it is highly likely that the best alignment path between the init₁ substrings, lies within the subtable defined by the band. We assume this is the case and compute the optimal local alignment in this band, using the ordinary dynamic programming algorithm. Assuming that the best local alignment is indeed within the defined band, the local alignment algorithm essentially merges diagonal runs found in the previous stages to achieve a local alignment which may contain indels. The band width is dependent on the ktup choice. The best local alignment computed in this stage is called opt.

6.

In the last stage, the database sequences are ranked according to init_n scores or opt scores, and the full dynamic programming algorithm is used to align the query sequence against each of the highest ranking result sequences.

Although FASTA is a heuristic, and as such it is possible to show instances in which the alignments found by the algorithm are not optimal, it is claimed (and supported by experience) that the resulting alignment scores well compare to the optimal alignment, while the FASTA algorithm is much faster than the ordinary dynamic programming alignment algorithm.

Next: BLAST - Basic Local Up: No Title Previous: Introduction - Sequence Alignment

Itshack Pe`er
1999-01-10