DNA vs. Protein Searches

Next: Specificity and Sensitivity of Up: Biological Databases and Retrieval Previous: DBGET

DNA vs. Protein Searches

If we have a coding nucleotide sequence, we can translate it into protein sequence. (The other direction is, of-course, ambiguous, because the genetic code is degenerated). So, if we have a nucleotide sequence, should we search the DNA databases only? Or should we translate it to protein and search protein databases?
Usually, we should use proteins for database similarity searches when possible.
The reasons for this conclusion are:

There are very different DNA sequences that code for similar protein sequences. We certainly do not want to miss those.
When comparing DNA sequences, we get significantly more random matches than we get with proteins. There are several reasons for that:
- DNA is composed of 4 characters: A,G,C,T. Hence, two unrelated DNA sequences are expected to have 25 $\%$ similarity.
- In contrast, protein sequence is composed of 20 characters (AA). The sensitivity of the comparison is improved. It is accepted that convergence of proteins is rare, meaning that high similarity between two proteins always means homology.
- The DNA databases are much larger, and grow faster than protein databases. Bigger databases means more random hits!
For DNA we usually use identity matrices, while for protein more sensitive matrices like PAM and BLOSUM are used. This allows better search results.
Proteins are rarely mutated during evolution. Due to their conservation, searching them reveals remote evolutionary relationships.

Next: Specificity and Sensitivity of Up: Biological Databases and Retrieval Previous: DBGET

Itshack Pe`er
1999-01-17