next up previous
Next: Multiple Alignment with Consensus Up: Approximation Algorithms for Multiple Previous: Problem Definition

   
The Center Star Method for (SP) Alignment

In this section, we present an approximation algorithm for calculating the optimal multiple alignment under the SP metric (see e.g. [3] [pp 348-350]). The algorithm achieves an approximation ratio of two. Given a multiple alignment ${{\cal M}}$, let d(Si,Sj) be the score of the pairwise alignment it induces on Si, Sj. Our target function is the SP value of ${{\cal M}}$ which is $d({{\cal M}}) \equiv \sum_{i < j} d( S_i, S_j)$ (sum over all pairs of the score of the induced alignment).

Problem 5.1   The SP alignment problem.
INPUT: A set of sequences $\S$.
QUESTION: Compute a global multiple alignment ${{\cal M}}$ with minimum sum-of-pairs score.

We denote by D(S,Y) the score of the optimal alignment between sequences S and Y.
  
Figure 5.1: A generic center star for six strings, where the center string (Sc) is S3.

\fbox{\epsfig{figure=lec05_figs/lec05_centerstar.eps}}





Definition 5.2   We say that a tree T having the elements of $\S$ as its nodes, induces a multiple alignment ${{\cal M}}(T)$ over $\S$, if for each edge $(S,Y) \in E(T)$ the value of the alignment induced by ${{\cal M}}$ on S and Y is D(S, Y) (optimal). In such a case we write ${{\cal M}}= {{\cal M}}(T)$.

Given $\S = \left( { S_1, \ldots, S_k } \right)$, let $S_c \in \S$ denote the element of $\S$, for which $\sum_{i \neq c} D(S_i, S_c)$ is minimal. We refer to Sc as the center of ${{\cal M}}_c$.

Definition 5.3   We define the center star, Tc to be a star tree of k nodes, with the center node labeled Sc and with each of the k-1 remaining nodes labeled by a distinct sequence in $\S \setminus \left\{ { S_c } \right\}$ (see figure 5.1). The multiple alignment ${{\cal M}}_c$ of $\S$ is the multiple alignment induced by the center star, i.e., for each $v \neq c$, the alignment ${{\cal M}}_c$ induces an optimal pairwise alignment between Sc and Sv.

The Center Star Algorithm:
1.
Find $S_t \in \S$ minimizing $\sum_{i \neq t} D(S_i, S_t)$ and let ${{\cal M}}= \left\{ {S_t} \right\}$.
2.
Add the sequences in $\S \setminus \left\{ {S_t} \right\}$ to ${{\cal M}}$ one by one so that the alignment of every newly added sequence with St is optimal. Add spaces, when needed, to all pre-aligned sequences.
Running time analysis:
1.
$k \choose 2$ O(n2) for step 1.
2.
$\sum_{i=1}^{k-1} O((i \cdot n) \cdot n) = O(k^2 \cdot n^2)$ for step 2. (Since the worst-case length of S'c after the addition of i strings is $(i + 1) \cdot n$)

Lemma 5.2   For $1 \leq i,j \leq k, i \ne j$ it holds that $d(S_i, S_j) \leq
D(S_i, S_c) + D(S_c, S_j)$.  

Proof:Since the triangle inequality holds for every single column of the alignment by the definition of the scoring scheme, it also holds for entire strings by the definition of d. Therefore $d(S_i, S_j) \leq d(S_i, S_c) +
d(S_c, S_j)$. But the edges (Si, Sc), (Sc, Sj) are edges of E(Tc), thus d(Si, Sc) = D(Si, Sc). It follows that $d(S_i, S_j) \leq
D(S_i, S_c) + D(S_c, S_j)$.

Let ${{\cal M}}^{*}$ denote the optimal alignment of $\S$. Let d*(Si, Sj) denote the value of alignment between Si and Sj induced by ${{\cal M}}^{*}$.

Theorem 5.3  

\begin{displaymath}\frac{d({{\cal M}}_c)}{d({{\cal M}}^*)} \leq \frac{2(k-1)}{k} < 2\end{displaymath}

.  

Proof:By Lemma 5.2 it follows that

\begin{displaymath}2d({{\cal M}}_c) = \sum_{i \ne j} d( S_i, S_j ) \leq \sum_{i ...
...S_c, S_j)} \right) = 2 ( k - 1) \sum_{i \ne c} D(S_c,
S_i)
\end{displaymath} (5.1)

We define W to be $\sum_{i \ne c} D(S_c,S_i)$ and we get:

\begin{displaymath}2d({{\cal M}}_c) \leq 2 ( k - 1)W
\end{displaymath} (5.2)

On the other hand, by the choice of c it follows that:

\begin{displaymath}2d({{\cal M}}^*) = \sum_{i \ne j} d^{*}(S_i, S_j) \geq \sum_{...
...um_{i} \sum_{j \ne i } D(S_i, S_j ) \geq
\sum_{i} W = k W.
\end{displaymath} (5.3)

And finally,

\begin{displaymath}\frac{d({{\cal M}}_c)}{d({{\cal M}}^{*})} \leq \frac{2(k-1)W}{kW} = \frac{2(k-1)}{k}.
\end{displaymath} (5.4)

Theorem 5.3 implies that calculating the multiple alignment of the center star produces a multiple alignment with a value which is at most $R_k = \frac{2(k-1)}{k}$ times the value of the optimal alignment. For example $R_3 = \frac{4}{3}$, $R_4 =
\frac{3}{2}$.
next up previous
Next: Multiple Alignment with Consensus Up: Approximation Algorithms for Multiple Previous: Problem Definition
Itshack Pe`er
1999-03-16