(Notes from: Tan, Steinbach, Kumar + Ghosh)

K-Means Algorithm

• • • K = # of clusters (given); one “mean” per cluster Interval data

Initialize means (e.g. by picking k samples at random) • Iterate: (1) assign each point to nearest mean (2) move “mean” to center of its cluster.

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

2

Assignment Step; Means Update

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

3

Convergence after another iteration

Complexity: O(k . n . # of iterations)

(C) Vipin Kumar, Parallel Issues in Data Mining, VECPAR 2002

4

K-means

– J. MacQueen, Some methods for classification and analysis of multivariate observations," Proc. of the Fifth Berkeley Symp. On Math. Stat. and Prob., vol. 1, pp. 281-296, 1967. – E. Forgy, Cluster analysis of multivariate data: efficiency vs. interpretability of classification," Biometrics, vol. 21, pp. 768, 1965. – D. J. Hall and G. B. Ball, ISODATA: A novel method of data analysis and pattern classification," Technical Report, Stanford Research Institute, Menlo Park, CA, 1965. The history of k-means type of algorithms (LBG Algorithm, 1980) R.M. Gray and D.L. Neuhoff, "Quantization," IEEE Transactions on Information Theory, Vol. 44, pp. 2325-2384, October 1998. (Commemorative Issue, 1948-1998)

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

5

K-means Clustering – Details

Complexity is O( n * K * I * d )

– – – n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Easily parallelized Use kd-trees or other efficient spatial data structures for some situations Pelleg and Moore (X-means)

Sensitivity to initial conditions A good clustering with smaller K can have a lower SSE than a poor clustering with higher K

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

6

Limitations of K-means

K-means has problems when clusters are of differing – Sizes – Densities – Non-globular shapes Problems with outliers Empty clusters

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

7

Limitations of K-means: Differing Density

Original Points

K-means (3 Clusters)

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

8

Limitations of K-means: Non-globular Shapes

Original Points

K-means (2 Clusters)

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

9

Overcoming K-means Limitations

Original Points

K-means Clusters

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

10

Solutions to Initial Centroids Problem

Multiple runs Cluster a sample first ….

Bisecting K-means – Not as susceptible to initialization issues

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

11

Bisecting K-means Example

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

12

Generalizing K-means

– Model based k-means

“means” are probabilistic models”

– (unified framework, Zhong & Ghosh, JMLR 03)

– Kernel k-means

Map data to higher dimensional space Perform k-means clustering Has a relationship to spectral clustering – Inderjit S. Dhillon, Yuqiang Guan, Brian Kulis: Kernel kmeans: spectral clustering and normalized cuts. KDD 2004: 551-556 ICDM: Top Ten Data Mining Algorithms K-means December, 2006 13

Clustering with Bregman Divergences

Banerjee, Merugu, Dhillon, Ghosh, SDM 2004; JMLR 2005 – Hard Clustering: KMeans-type algo possible for any Bregman Divergence – Bijection: convex function Bregman divergence exp. Family Soft Clustering: efficient algo for learning mixtures of any exponential family

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

14

Bregman Hard Clustering

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

15

Algorithm Properties

ICDM: Top Ten Data Mining Algorithms

K-means

December, 2006

16...