The FastA algorithm estimates the distribution of these scores by empirical simulation. Note that this random distribution is not Gaussian. Under reasonable assumptions it is Extreme Value Distribution. The easy case is where that the real query score distribution is very different from the random score distribution - and so it's easy to say whether the score is significant.
A more complex case is presented in the next figure. In this case the random and real distribution have a large common area along the score axis, this means many random records have a better or equivalent score to the query score and it is hard to distinguish really related records from random ones.
. |
The widely accepted solution is to consider the E-value: the expectancy of the number of records with score that exceed our query score in a the random distribution. if the E-value is low enough will consider the score to be significant.
i.e an E-score of suggests, with extremely high confidence, that the query is evolutionary related to the target matched in the database.
Another way of measuring the significance of a score considers the mean of the random score distribution. This distance is normalized by the standard deviation of that distribution to form the Z-score. Higher Z-scores are better because the further the real score is from this mean (in standart deviation units) - the more significant it is.
Under reasonable assumptions the random score distribution for optimal ungaped local alignments can be proved to follow extreme value distribution (which proved to be significantly different from the normal distribution) [3]. In the current versions of FastA and BLAST search programs, the evaluation of statistical significance is based upon the extreme value distribution. These evaluations take the form of E-scores.
Once the E-value is known, one can count how many records observed with such E-value or lower, and compare this number with its theoretical distribution in the random model.
The probablity of the observed number (P-value) messures it's significance:
The smaller it is the more unsual is our result.
More formally:
Define the random variable:
Ye = The observed number of random records achieving E-value E or better(smaller).
Ye is distributed Poisson with parameter E:
Note that this model assumes an I.I.D trial for each databse record.
Under these assumptions, the alignment score behaves like a random walk when the alignment is elongated. The probablility if the highest score is such a random walk is at least s ,decreases exponentially with s. [1]