FastA scores

The FastA algorithm estimates the distribution of these scores by empirical simulation. Note that this random distribution is not Gaussian. Under reasonable assumptions it is Extreme Value Distribution. The easy case is where that the real query score distribution is very different from the random score distribution - and so it's easy to say whether the score is significant.

**Figure 5.7:** Easy Case: Illustration of an easy case of estimating significance. The score of really related records are distributed away from random records and thus can easily identified.
$\includegraphics[width=11cm]{lec05_picturs/slide1I.eps}$

A more complex case is presented in the next figure. In this case the random and real distribution have a large common area along the score axis, this means many random records have a better or equivalent score to the query score and it is hard to distinguish really related records from random ones.

**Figure 5.8:** Complex Case: Illustration of a complex case of estimating significance. The dark area represents the number of random records (shuffeled query sequnce) that exceed the query score. In this case the common area betweeb the random plot and the real plot is large, which makes it hard to distinguish between the real and random ones
$\includegraphics[width=11cm]{lec05_picturs/slide2I.eps}$ .

The widely accepted solution is to consider the E-value: the expectancy of the number of records with score that exceed our query score in a the random distribution. if the E-value is low enough will consider the score to be significant.

i.e an E-score of $\sim 10^{-50}$ suggests, with extremely high confidence, that the query is evolutionary related to the target matched in the database.

Another way of measuring the significance of a score considers the mean of the random score distribution. This distance is normalized by the standard deviation of that distribution to form the Z-score. Higher Z-scores are better because the further the real score is from this mean (in standart deviation units) - the more significant it is.

**Figure 5.9:** Fasta Z-score: The arrows represents the distance in Std.Dev units from the distribution mean. The Z-Score = (RawScore- mean)/ Std.Dev
$\includegraphics[width=11cm]{lec05_picturs/slide3I.eps}$

Under reasonable assumptions the random score distribution for optimal ungaped local alignments can be proved to follow extreme value distribution (which proved to be significantly different from the normal distribution) [3]. In the current versions of FastA and BLAST search programs, the evaluation of statistical significance is based upon the extreme value distribution. These evaluations take the form of E-scores.
Once the E-value is known, one can count how many records observed with such E-value or lower, and compare this number with its theoretical distribution in the random model. The probablity of the observed number (P-value) messures it's significance: The smaller it is the more unsual is our result.
More formally: Define the random variable: Ye = The observed number of random records achieving E-value E or better(smaller). Ye is distributed Poisson with parameter E:

Note that this model assumes an I.I.D trial for each databse record.
Under these assumptions, the alignment score behaves like a random walk when the alignment is elongated. The probablility if the highest score is such a random walk is at least s ,decreases exponentially with s. [1]