Clustering: A brief overview
In general, a cluster is defined as a set of similar objects (p.9 Hartigan). This "similarity" in a given set may vary according to data, because clustering is used in various fields such as numerical taxonomy, morphometrics, systematics, etc. Thus, a clustering algorithm that fits the numerical measure of optimization in a data may not optimize another set of data (for example, depending on the units selected). There are many algorithms to solve a clustering problem. The algorithms used in our applet concentrate on "joining", "splitting", and "switching" search methods (also called bottom up, top down, and interchange, respectively). They are shown by their representative methods: minimum-cost spanning tree algorithm, maximum-cut, and k-means algorithm.
Our clustering applet uses a three-dimensional model for demonstration purposes. However the algorithms can calculate clusters in n-dimensions. We are planning to an algorithm based on principal component analysis, projecting the n-dimensional data onto a 3-dimentional space.
1. Minimum-Cost Spanning Tree Clustering:
In this algorithm, we create a minimum-cost spanning tree from the given vertices, and take out k-1 largest edges, where k is the number of clusters. Each edge removed creates one more component. These created components are the k clusters. Each of these clusters will have a path that connects all vertices within a cluster with a minimum cost.
The minimum-cost spanning tree (MCST) takes an undirected graph and creates a fixed connected subgraph containing all vertices, such that the sum of the costs of the edges in the subgraph is minimum. The MCST algorithm used in this applet is Prim's algorithm. The worst case running time is O((|V|+|E|)log|V|) (p.208-212 Manber).
Single-linkage clusters, such as minimum-cost spanning tree method, are known to cause long sausage shape graphs, in which objects far apart are linked together by a chain of close objects. A cluster of nodes may be close to some other cluster, but because of a few "bridge" nodes, it may be connected to another cluster that is at a more distance location (pg.200 Hartigan). Such resulting shape is unique to this algorithm.
2. MacQueen's k-means Clustering:
K-means clustering algorithm uses an interchange (or switching) method to partition a graph into clusters. An initial partition is given, and new partitions are obtained by switching an object from one cluster to another. MacQueen's method randomly picks k points initially, where each point stands for a cluster to be made. A set of points is taken from the graph, and each point is added to the closest cluster. The "closeness" to the cluster is determined by calculating the distance between a point and the centroid of a cluster--the mean distance of points that belong to that cluster. Then each point is visited to recalculate the distance to the updated clusters. If the closest cluster of the point is not the one it currently belongs, the point will switch to the new cluster. When switching occurs, centroids of both modified clusters have to be recalculated. This procedure is repeated until no more switching takes place (in our case, we have set an upper bound of five repetitions to speed up the Java applet).
Although the initial points may not generate an optimal solution, MacQueen's method converges the sum of squared distances (population variance) within the clusters to a local optima (p.106 Hartigan).
3. Maximum-Cut Clustering:Another method of clustering is to split a given graph into k clusters. However, such splitting algorithms often have difficulty in deciding on a good splitting rule. The algorithm we used is based on maximum cut (which is an NP-hard, or intractable problem). The nodes are partitioned successively, each time using maximum cut procedure, into k subsets.
For maximum cut procedure, we are using an approximation algorithm. This algorithm is also visualized by an applet in RIOT (along with the minimum cut). Please to refer to Maximum/Minimum cut page for more details.
To use the maximum cut to make clusters, we have done the following. Each time a subset of the graph is partitioned, we choose the next component to partition by calculating the average weight of the edges relative to the number of vertices, and taking the larger side. We have made this choice because most of the time, desired clusters are the ones with minimum mean distance for each centroid.
Garey, Michael R and Johnson, David S. Computers and Intractability: A Guide to the Theory of NP-Completeness, Bell Lab, NJ, 1979.
Hartigan, John A. Clustering Algorithms, John Wiley & Sons, New York, NY, 1975.
Manber, Udi. Introduction to Algorithms-A Creative Approach, Addison-Wesley, MA, 1989.
This project is made possible by Professor Dorit S. Hochbaum's ONR research grant N00014-91-J-1241.
Questions or comments? Send mail to Professor Hochbaum.
Copyright ©1997 Professor Dorit Hochbaum