next up previous
Next: gap penalties types Up: Gap Penalty Previous: Gap Penalty

   
Motivation

The concept of a gap in an alignment is important in many biological application, because the insertion or deletion of an entire substring often occurs as single mutational event. Moreover, many of these single mutational events can create gaps of quite varying sizes. At the protein level, two protein sequences might be relatively similar over several intervals but differ in intervals where one contains a protein subunit that the other does not.

One concrete illustration of the use of gaps in the alignment model comes from the problem of cDNA matching [2] (chapter 11). In this problem, one string is much longer than the other, and the alignment best reflecting their relationship should consist of a few regions of very high similarity interspered with 'long' gaps in the shorter string. Note that the matching regions can have mismatches and spaces, but these should amount only to a small fraction of the region.

An RNA molecule is transcribed from DNA of the gene. That RNA transcript is a complement of the DNA in the gene in that each A in the gene is replaced by U in the RNA, each T is replaced by A, each C by G, and each G by C. Moreover, the RNA transcript covers the entire gene, introns as well as exons. Then in a process that is not complely understood, each introns-exon boundary in the transcript is located, the RNA corresponding to the introns is spliced out, and the RNA regions corresponding to exons are concatenated. Additional processing occures. The resulting RNA molecule is called the messenger RNA (mRNA): it leaves the cell nucleus and is used to create the protein it encodes.

Each cell (usually) contains a copy of all the chromosomes and hence, of all the genes of the entire individual, yet in each specialized cell (a liver cell for example) only a small fraction of the genes are expressed. That is, only a small fraction of the proteins encoded in the genome are actually produced in that specialized cell. A standard method to determine which proteins are expressed in the specialized cell line, and to hunt for the location of the encoding genes, involves capturing the mRNA in that cell after it leaves the cell nucleus. That mRNA is then used to create a DNA sequence complementary to it. This sequence is called cDNA (complementary DNA). Compared to the original gene, the cDNA sequence consists only of the cancatenation of exons in the gene. After cDNA is obtained, the problem is to determine where the gene associates with that cDNA resides, and it becomes one of aligning the cDNA sequence against the longer DNA sequence in a way that reveals the exons.


next up previous
Next: gap penalties types Up: Gap Penalty Previous: Gap Penalty
Itshack Pe`er
1999-01-03