Next: Bibliography Up: No Title Previous: BLOSUM - BLOcks SUbstitution

Multiple Alignment

$\begin{dfn}{\rm A {\em multiple alignment} of strings $S_1, S_2, \dots, S_k$\spa... ...extension of $S_j$ , obtained by insertion of blanks. \end{enumerate}} \end{dfn}$

**Figure 3.3:** A multiple alignment of *ACBCBD*, *CADDB* and *ACABCD*.
$\fbox{ \begin{minipage}[h]{\textwidth} \begin{center}\input{lec03_figs/lec03_multial.eepic} \setlength{\unitlength}{0.1000in} \end{center} \end{minipage} }$

We are interested in finding a common alignment of several sequences, because this multiple similarity suggests a common structure of the protein product, a common function or a common evolutionary source. A multiple alignment carries more information than a pairwise one, as a protein can be matched against a family of proteins instead of only against another one.

The best multiple alignment of r sequences is calculated using an r-dimensional hyper-cube D, defining $D(j_1,j_2,\dots,j_r)$ to be the best score for aligning the prefixes of lengths $j_1,j_2,\dots,j_r$ of the sequences $x_1,x_2,\dots,x_r$ , respectively.
We define

$\begin{displaymath}D(0,0,\dots,0) = 0 \end{displaymath}$

And we calculate

$\begin{displaymath}D(j_1,j_2,\dots,j_r) = min_{\epsilon \in \{0,1\}^n,\, \epsilo... ...ilon_r) + \rho(\epsilon_1 x_{j_1},\dots,\epsilon_r x_{j_r})\}\end{displaymath}$

where $\rho$ is the cost function. The size of the hyper-cube is $O(\prod^{r}_{j=1}n_j)$ , where n_j is the length of x_j, where computation of each of each entry consider 2^r -1 others.
If $n_1=n_2=\dots=n_r=n$ , the space complexity is of O(n^r) and the time complexity is of O(2^r n^r).

There are several known useful possibilities for measuring the divergence of a set of aligned strings, namely the total distance between them.

Distance from Consensus - The consensus of an alignment is a string of the most common character in each column of the alignment. The total distance between the strings is defined as the number of characters that differ from the consensus character of their column.
Evolutionary Distance - The weight of the lightest evolutionary tree that can be constructed from the sequences, with the weight of the tree defined as the number of changes between pairs of sequences that correspond to two adjacent nodes in the tree, summed over all such pairs.
Sum of Pairs - The sum of pairwise distances between all the pairs of sequences.

Carrillo and Lipman [3] found a heuristic method for accelerating the search for the best multiple alignment. The method is based on the property that if the strings are relatively similar, the alignment path would be close to the main diagonal, therefore not all the values in the multi-dimensional cube need to be calculated, we now detail this algorithm.

Assuming an upper bound on cost of the best alignment, we will discard some alignments that are a priori known to be more expensive than the bound on the cost.

Let A be an alignment of strings $X_1,x_2, \dots , x_r$ . Denote by A_i,j the pair of rows in A containing only x_i and x_j, and by c(A_i,j) the cost of this pairwise alignment. Denote by c(A) the total cost of A, and suppose we define $c(A)=\sum_{i<j}c(A_{i,j})$ . Let A^* be the optimal alignment (the one with the minimal cost), and suppose we know that $c(A^*) \leq c'$ . Therefore,

$\begin{displaymath}c' \geq c(A^*) = \sum_{i<j}c(A^{*}_{i,j}) = c(A^{*}_{u,v}) + ... ...) \geq c(A^{*}_{u,v}) + \sum_{i<j \, (i,j)\neq(u,v)} D(x_i,x_j)\end{displaymath}$

Where D(x,y) is the optimal score for aligning strings x and y. It follows that

$\begin{displaymath}c(A^{*}_{u,v}) \leq c' - \sum_{i<j \, (i,j)\neq(u,v)} D(x_i,x_j)\end{displaymath}$

A^*_u,v is a projection of A^* on the uv-plain. By calculating D(x_i,x_j) for each i and j, we can find $B(u,v) = c' - \sum_{i<j \, (i,j)\neq(u,v)} D(x_i,x_j)$ .

Now, consider a cell $(i_1 ,i_2 ,\dots ,i_u=s ,\dots ,i_v=t ,\dots ,i_r)$ whose projection to the uv-plane is (s,t). If the best alignment A^* passes through this cell, then its projection A^*_u,v passes through (s,t), and its cost c(A^*_u,v) agrees with $best^{(u,v)}_{s,t} \leq c(A^*_{u,v}) \leq B(u,v)$ where best^(u,v)_s,t is an upper bound on the optimal score for an alignment through (s,t) in the uv-plain. We can compute such an upper bound as:

$\begin{displaymath}best^{(u,v)}_{i,j} = D(x_{u,1} x_{u,2} \dots x_{u,i-1} , x_{v... ...,j}) + D(x_{u,i+1}\dots x_{u,n_u} , x_{v,j+1} \dots x_{v,n_v})\end{displaymath}$

where $d(\kappa_1,\kappa_2)$ is the cost of matching the characters $\kappa_1$ and $\kappa_2$ .

Therefore if best^(u,v)_s,t > B(u,v), then the best alignment A^* cannot pass through the cell
$(i_1 ,i_2 ,\dots ,i_u=s ,\dots ,i_v=t ,\dots ,i_r)$ for any $i_1,i_2,\dots,i_{u-1},i_{u+1},\dots,i_{v-1},i_{v+1},\dots,i_r$ , and these cells can be discarded from the computation.

Next: Bibliography Up: No Title Previous: BLOSUM - BLOcks SUbstitution

Itshack Pe`er
1999-01-10