Given a database of sequences, we would like to partition it into
families of ``similar'' sequences. For this purpose we would like
to encompass our knowledge on the common properties of the sequences
in a family profile (formal definition to follow).
Constructing a profile of a family enables us to identify its members
and test whether or not a new sequence belongs to the family. Moreover,
searching the database with a profile is more sensitive than searching using a
single sequence of the family: When searching with a single
sequence, we can only look for the sequences in the database with the best
alignments with the given sequence, while when using a profile we
may test for membership to the family.
Definition 5.15
for an alignment S' of length l, a profile is a
matrix, whose columns are
probability vector denoting the frequencies of each symbol in the
corresponding alignment column.
Any alignment between a sequence B and a profile P (i.e. both have the
same length) can be evaluated by
.
Clearly, using dynamic programming, we can find the best alignment
of a sequence against a profile.
The key in pairwise alignment is scoring two positions x and y
:
.
For a letter x and a column y of a
profile, let
be the probability of x being in column y. The
value for x depends on the frequency of it's occurences in
the column y. We also need to devise a score for
.
In order to find whether a given sequence is a member of certain
family, we use a usual pairwise dynamic programming alignment
to compare the given sequence to the family profile.
Next:Iterative pairwise alignment Up:Common Multiple Alignments Methods Previous:Common Multiple Alignments MethodsItshack Pe`er 1999-03-16