Non-hierarchical cluster analysis (often known as K-means Clustering Method) forms a grouping of a set of units, into a pre-determined number of groups, using an iterative algorithm that optimizes a chosen criterion. Starting from an initial classification, units are transferred from one group to another or swapped with units from other groups, until no further improvement can be made to the criterion value. There is no guarantee that the solution thus obtained will be globally optimal - by starting from a different initial classification it is sometimes possible to obtain a better classification. However, starting from a good initial classification much increases the chances of producing an optimal or near-optimal solution.
The algorithm is called k-means, where k is the number of clusters you want; since a case is assigned to the cluster for which its distance to the cluster mean is the smallest. The action in the algorithm centers on finding the k-means. You start out with an initial set of means and classify cases based on their distances to the centers. Next, you compute the cluster means again, using the cases that are assigned to the cluster; then, you reclassify all cases based on the new set of means. You keep repeating this step until cluster means don’t change much between successive steps. Finally, you calculate the means of the clusters once again and assign the cases to their permanent clusters.
Steps in Non-Hierarchical Cluster Analyisis
In this method, the desired number of clusters is specified in advance and the ’best’ solution is chosen. The steps in such a method are as follows:
1. Choose initial cluster centers (essentially this is a set of observations that are far apart — each subject forms a cluster of one and its center is the value of the variables for that subject).
2. Assign each subject to its “nearest” cluster, defined in terms of the distance to the centroid.
3. Find the centroids of the clusters that have been formed
4. Re-calculate the distance from each subject to each centroid and move observations that are not in the cluster that they are closest to.
5. Continue until the centroids remain relatively stable.
Non-hierarchical cluster analysis tends to be used when large data sets are involved. It is sometimes preferred because it allows subjects to move from one cluster to another (this is not possible in hierarchical cluster analysis where a subject, once assigned, cannot move to a different cluster). Two disadvantages of non-hierarchical cluster analysis are: (1) it is often difficult to know how many clusters you are likely to have and therefore the analysis may have to be repeated several times and (2) it can be very sensitive to the choice of initial cluster centers. Again, it may be worth trying different ones to see what impact this has.
One possible strategy to adopt is to use a hierarchical approach initially to determine how many clusters there are in the data and then to use the cluster centers obtained from this as initial cluster centers in the non-hierarchical method.
Determining the Number of Clusters in a Data Set
A quantity often labeled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.
The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own...