next up previous
Next: Future Plans Up: cDNA Clustering Previous: Simulation Results

   
Clustering Real cDNA Data

A test of clustering real cDNA data was performed. The input contained 2329 cDNAs, originating from 18 genes. The true clustering, obtained by hybridization with long, unique sequences, is given in table 12.1.
 
Table 12.1: True clusters
 
Cluster Cluster size Gene name
T18 709 Ef1 alpha
T17 285 clone 190B1
T16 284 Cytochr c oxi
T15 213 tubulin beta
T14 187 40SRibo protS6
T13 146 40SRibo protS3
T12 108 40SRibo protS4
T11 91 GAPDH
T10 86 60SRibo protL4
T9 67 Ef1 beta
T8 43 Human calmodulin
T7 39 heat shock cogKD71
T6 32 heat shock cogKD90
T5 14 Human TNF recep
T4 12 Human AEBP1
T3 10 clone 244D14
T2 2 clone 241F17
T1 1 Human anion ch
 

The high variability in abundance of genes can be easily seen. The results of the test are summarized in figure 12.7. In 14 out of the 17 clusters generated by the algorithm, over 92% of the entities belong to the same gene (true cluster). Those clusters are called almost pure. As the correct clusters are not known (in real experiments), and the main goal of cDNA clustering is to avoid repeated sequencing of cDNAs originating from the same gene, the following strategy can be used: From each cluster, up to 10 cDNAs are picked randomly and sequenced. If, for example, 9 out of 10 give the same sequence, the cluster is with high certainty almost pure, and no more sequencing of its members is needed. Otherwise, all the members of the cluster are sequenced. This strategy may save about 75% of the sequencing cost.
  
Figure: Clustering results on real cDNA data. A: The binarized similarity matrix. A block point appears at position (i,j) iff $S_{ij}\geq 110$. B: Reordering of A according to the true clustering. cDNAs from the same t rue cluster appear consecutively, and the black lines are the borders between the different clusters. C: Reordering of A according to the clustering produced by the HCC algorithm. Clusters appear in the order of detection. D: Comparison of the algorithmic solution and the true solution. Rows and columns are ordered as in B. Position (i,j) is black iff the algorithm put i and j in the same cluster.

\fbox{\epsfig{figure=lec12_fig/lec12_realdata.ps,width=15cm}}





next up previous
Next: Future Plans Up: cDNA Clustering Previous: Simulation Results
Itshack Pe`er
1999-03-16