Synthetic Data

Next: Temporal Gene Expression Data Up: Clustering Using BioClust Previous: Clustering Using BioClust

Synthetic Data

BioClust is tested on the members of $\Omega(H,p)$ . The simulation procedure is as follows (please refer to figure 12.3 for visualization of the simulation procedure):

Generate G from H by independently removing each edge in H with probability p and adding each edge not in H with probability p;
Randomly permute the order of vertices in G and run BioClust with affinity threshold $\tau = 0.5$ ;
Compare BioClust's output to the original clique graph H;

**Figure:** A visualization of the simulation procedure. A: The adjacency matrix of the original clique graph H before introduction of errors. Position (i,j) is white if $(i,j) \in E(H)$ , that is, if i and j belong to the same cluster. B: The same matrix after introduction of errors. Note that the cluster structure is still visible for all but the smallest clusters. C: The same as B but vertex order is randomly permuted. This is the actual input to the algorithm. D: Matrix C reordered according to solution produced by the algorithm. With the exception of perhaps the smallest clusters, the essential cluster structure is reconstructed.

There are several comparison criteria, which can be used to compare the algorithm's output to the original clique graph. The matching coefficient is defined by $\frac{N_{00}+N_{11}}{N_{00}+N_{01}+N_{10}+N_{11}}$ that is total number of matching entries divided by total number of entries. The Jaccard coefficient is defined by $\frac{N_{11}}{N_{01}+N_{10}+N_{11}}$ , which is similar to the matching coefficient, only with N₀₀ the number of entries which are zero in both matrices removed. In sparse graphs N₀₀ will be a dominant factor, thus Jaccard coefficient is more sensitive when dealing with sparse graphs.

Table 12.1: Performance of the BioClust algorithm for different values of p. Mean values of matching coefficient and Jaccard coefficient are given.


cluster structure	n	p	matching coeff.	Jaccard coeff.
$\{0.4,0.2,0.1,0.1,0.1,0.1\}$	500	0.2	1.0	1.0
$\{0.4,0.2,0.1,0.1,0.1,0.1\}$	500	0.3	0.999	0.995
$\{0.4,0.2,0.1,0.1,0.1,0.1\}$	500	0.4	0.939	0.775

Table 12.1 presents results of simulation for different values of contamination error p. The values of the matching coefficient and the Jaccard coefficient are presented. It can be seen that the Jaccard coefficient is more sensitive. One can also observe the effect of p on the performance of the algorithm.

**Figure:** Simulation results for H with cluster structure of $\{\frac {1}{2}, \frac {1}{4}, \frac {1}{8}, \frac {1}{16}, \frac {1}{16}\}$ . The x-axis is the number of clustered entities n and y-axis is the mean value of the Jaccard coefficient. Each curve corresponds a specific probability $\alpha$ of contamination error.
$\resizebox*{5in}{4in} {\includegraphics{lec12_fig/resSim2.eps}}$

Figure 12.4 presents results of simulation for different values of n and p. It can be seen that the properties of the theoretical algorithm are preserved in its practical implementation. We get better performance when the number of clustered entities increases.

Next: Temporal Gene Expression Data Up: Clustering Using BioClust Previous: Clustering Using BioClust

Peer Itsik
2001-02-01