Next: A Probabilistic Model of
Up: Gene Finding
Previous: Detecting Promoter Regions
Gene Structure in Eukaryotes
The gene structure and the gene expression mechanism in eukaryotes are
far more complicated than in
prokaryotes. In typical eukaryotes, the region of the DNA coding
for a protein is usually not continuous. This region is composed of
alternating stretches of exons and introns. During
transcription, both exons and introns are transcribed onto the
RNA, in their linear order. Thereafter, a process called
splicing takes place, in which, the intron sequences are excised
and discarded from the RNA sequence. The remaining RNA segments,
the ones corresponding to the exons are ligated to form the mature
RNA strand.
A typical multi-exon gene has the following structure (as
illustrated in figure 7.5). It
starts with the promoter region, which is followed by a
transcribed but non-coding region called 5' untranslated
region (5' UTR). Then follows the initial exon which contains the
start codon. Following the initial exon, there is an alternating
series of introns and internal exons, followed by the terminating
exon, which contains the stop codon. It is followed by another
non-coding region called the 3' UTR. Ending the eukaryotic
gene, there is a polyadenylation (polyA) signal: the nucleotide
Adenine repeating several times. The exon-intron boundaries (i.e., the
splice sites) are signalled by specific short (2bp long) sequences. The 5'(3')
end of an intron (exon) is called the donor site, and the
3'(5') end of an intron (exon) is called the acceptor site.
The problem of gene identification is complicated in the case of
eukaryotes by the vast variation that is found in gene structure.
In order to be able to apprehend this, we shell consider some
statistics from the available genomic data. On average, a
vertebrate gene is around 30Kb long, out of which the coding
region is only about 1Kb long. The average coding region consists
of 6 exons, each about 150bp long. Huge deviations from the
average are observed. For example, the gene called
dystrophin is 2.4MB long. Blood coagulation-factor VIII has
26 exons whose size varies from 69bp to 3106bp, with the total
coding region reaching length around 186Kb and the introns lengths
adding up to 32.4Kb. Intron number 22 produces 2 transcripts
unrelated to this gene, one for each strand. An average 5' UTR is
750bp long, but it can be longer and span several exons (for
example, in the MAGE family). On average, the 3' UTR is about
450bp long, but examples exist where its length exceeds 5Kb (e.g.,
the gene for Kallman's syndrome).
Next: A Probabilistic Model of
Up: Gene Finding
Previous: Detecting Promoter Regions
Itshack Pe`er
1999-02-03