In any living cell that undergoes biological process different subsets of its genes are expressed in different stages of the process. The particular genes expressed at a given stage and their relative abundance are crucial to the cell's proper function. Analysis of gene expression patterns can provide an insight into gene/function relationships, effects of experimental treatments for diseases, and many other molecular biological processes.
Clustering techniques applied to gene expression data will partition genes into groups/clusters based on their expression patterns. Genes in the same cluster will have similar expression patterns, while genes in different clusters will have distinct well-separated expression patterns. We will see several examples of gene expression clustering in section 12.1.4.
Gene expression data can be represented by a real-valued expression matrix I, where Ii,j is the measured expression level of gene i in experiment j. Experiments can be different time points, different body tissues or different strains of the organism. The i-th row of the expression matrix is called the expression pattern or the transcriptor for gene i.
The similarity between expression patterns of any two genes is represented by a similarity matrix S, where Si,j is the similarity level between the expression patterns of gene i and gene j. The similarity matrix can be derived from the expression matrix by applying some similarity measure to expression patterns of every pair of genes. Examples of similarity measure are Euclidean distance and Pearson correlation.
The similarity matrix can be further transformed into similarity graph , whose vertices are genes and iff .