Next: BLAST - Basic Local
Up: No Title
Previous: Introduction - Sequence Alignment
FASTA
The FASTA algorithm is a heuristic method for string comparison. It was developed by Lipman and Pearson
in 1985 [6] and further improved in 1988 [7].
FASTA compares a query string against a single text string. When searching the whole database for matches to
a given query, we compare the query using the FASTA algorithm to every string in the database.
When looking for an alignment, we might expect to find a few segments in which there will be absolute identity
between the two compared strings. The algorithm is using this property and focuses on these identical regions.
The stages in the FASTA algorithm are as follows:
- 1.
- We specify an integer parameter called ktup (short for k respective tuples), and we look for ktup-length matching substrings
of the two strings. The standard recommended ktup values are six for DNA sequence matching and two for protein sequence matching.
The matching ktup-length substrings are referred to as hot spots. Consecutive hot spots are located along
the dynamic programming
matrix diagonals. This stage can be done efficiently by using a lookup table or a hash to store all the ktup-length
substrings from one string, and then search the table with the ktup-length substrings from the other string.
- 2.
- In this stage we wish to find the 10 best diagonal runs of hot spots in the matrix. A diagonal run is
a sequence of nearby hot spots on the same diagonal (not necessarily adjacent along the diagonal, i.e.,
spaces between
these hot spots are allowed). A run need not contain all the hot spots on its diagonal, and a diagonal may contain
more than one of the 10 best runs we find.
In order to evaluate the diagonal runs, FASTA gives each hot spot a positive score, and the space between
consecutive hot spots in a run is given a negative score that decreases with the increasing distance.
The score of the diagonal run is
the sum of the hot spots scores and the interspot scores. FASTA finds the 10 highest scoring diagonal runs under
this evaluating scheme.
- 3.
- A diagonal run specifies a pair of aligned substrings. The alignment is composed of matches (the hot spots)
and mismatches (from the interspot regions), but it does not contain any indels because it is derived from a single
diagonal. We next evaluate the runs using an amino acid (or nucleotide) substitution matrix, and pick the best scoring run. The
single best subalignment found in this stage is called init1. Apart from computing init1, a filtration is
performed and we discard of the diagonal runs achieving relatively low scores.
- 4.
- Until now we essentially did not allow any indels in the subalignments. We now try to combine ``good'' diagonal
runs from close diagonals, thus achieving a subalignment with indels allowed. We take ``good'' subalignments from
the previous stage (subalignments whose score is above some specified cutoff) and attempt to combine them into
a single larger high-scoring alignment that allows some spaces. This can be done in the following way:
We construct a directed weighted graph whose vertices are the subalignments found in the previous stage, and the
weight in each vertex is the score found in the previous stage of the subalignment it represents. Next, we extend
an edge from vertex u to vertex v if the subalignment represented by v starts at a lower row than where the
subalignment
represented by v ends. We give the edge a negative weight which depends on the number of gaps that would be created
by aligning according to subalignment v followed by subalignment u. Essentially, FASTA then finds a maximum weight
path in this graph. The selected alignment specifies a single local alignment between the two strings. The best
alignment found in this stage is marked initn. As in the previous stage, we discard alignments with relatively
low score.
- 5.
- In this step FASTA computes an alternative local alignment score, in addition to initn. Recall that init1defines a diagonal segment in the dynamic programming matrix. We consider a narrow diagonal band in the matrix, centered
along this segment. We observe that it is highly likely that the best alignment path between the init1 substrings, lies
within the subtable defined by the band. We assume this is the case and compute the optimal local alignment in this band,
using the ordinary dynamic programming algorithm. Assuming that the best local alignment is indeed within the defined band, the
local alignment algorithm essentially merges diagonal runs found in the previous stages to achieve a local alignment
which may contain indels. The band width is dependent on the
ktup choice. The best local alignment computed in this stage is called opt.
- 6.
- In the last stage, the database sequences are ranked according to initn scores or opt scores, and
the full dynamic
programming algorithm is used to align the query sequence against each of the highest ranking result
sequences.
Although FASTA is a heuristic, and as such it is possible to show instances in which the alignments found by
the algorithm are not optimal, it is claimed (and supported by experience) that the resulting alignment scores
well compare to the optimal alignment, while the FASTA algorithm is much faster than the ordinary dynamic
programming alignment algorithm.
Next: BLAST - Basic Local
Up: No Title
Previous: Introduction - Sequence Alignment
Itshack Pe`er
1999-01-10