Determining_the_number_of_clusters_in_a_data_set

Determining the number of clusters in a data set

Add article description

Determining the number of clusters in a data set, a quantity often labelled k as in the k-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem.

For a certain class of clustering algorithms (in particular k-means, k-medoids and expectation–maximization algorithm), there is a parameter commonly referred to as k that specifies the number of clusters to detect. Other algorithms such as DBSCAN and OPTICS algorithm do not require the specification of this parameter; hierarchical clustering avoids the problem altogether.

The correct choice of k is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing k without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when k equals the number of data points, n). Intuitively then, the optimal choice of k will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster. If an appropriate value of k is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.

Information–theoretic approach

Rate distortion theory has been applied to choosing k called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by information-theoretic standards.^[7] The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as k-means for all values of k between 1 and n, and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the dimensionality of the data. Jumps in the resulting values then signify reasonable choices for k, with the largest jump representing the best choice.

The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a p-dimensional random variable, X, consisting of a mixture distribution of G components with common covariance, Γ. If we let $c_{1}\ldots c_{K}$ be a set of K cluster centers, with $c_{X}$ the closest center to a given sample of X, then the minimum average distortion per dimension when fitting the K centers to the data is:

d_{K}={\frac {1}{p}}\min _{c_{1}\ldots c_{K}}{E[(X-c_{X})^{T}\Gamma ^{-1}(X-c_{X})]}

This is also the average Mahalanobis distance per dimension between X and the closest cluster center $c_{X}$ . Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The pseudo-code for the jump method with an input set of p-dimensional data points X is:

JumpMethod(X):
    Let Y = (p/2)
    Init a list D, of size n+1
    Let D[0] = 0
    For k = 1 ... n:
        Cluster X with k clusters (e.g., with k-means)
        Let d = Distortion of the resulting clustering
        D[k] = d^(-Y)
    Define J(i) = D[i] - D[i-1]
    Return the k between 1 and n that maximizes J(k)

The choice of the transform power $Y=(p/2)$ is motivated by asymptotic reasoning using results from rate distortion theory. Let the data X have a single, arbitrarily p-dimensional Gaussian distribution, and let fixed $K=\lfloor \alpha ^{p}\rfloor$ , for some α greater than zero. Then the distortion of a clustering of K clusters in the limit as p goes to infinity is $\alpha ^{-2}$ . It can be seen that asymptotically, the distortion of a clustering to the power $(-p/2)$ is proportional to $\alpha ^{p}$ , which by definition is approximately the number of clusters K. In other words, for a single Gaussian distribution, increasing K beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components.

Let X be a mixture of G p-dimensional Gaussian distributions with common covariance. Then for any fixed K less than G, the distortion of a clustering as p goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above, K is made an increasing function of p, namely, $K=\lfloor \alpha ^{p}\rfloor$ , the same result as above is achieved, with the value of the distortion in the limit as p goes to infinity being equal to $\alpha ^{-2}$ . Correspondingly, there is the same proportional relationship between the transformed distortion and the number of clusters, K.

Putting the results above together, it can be seen that for sufficiently high values of p, the transformed distortion $d_{K}^{-p/2}$ is approximately zero for K < G, then jumps suddenly and begins increasing linearly for K ≥ G. The jump algorithm for choosing K makes use of these behaviors to identify the most likely value for the true number of clusters.

Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been empirically verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing K using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple least squares error line fit of two line segments, which in theory will fall along the x-axis for K < G, and along the linearly increasing phase of the transformed distortion plot for K ≥ G. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully non-parametric and has been shown to be viable for general mixture distributions.

Finding number of clusters in text databases

When clustering text databases with the cover coefficient on a document collection defined by a document by term D matrix (of size m×n, where m is the number of documents and n is the number of terms), the number of clusters can roughly be estimated by the formula ${\tfrac {mn}{t}}$ where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element.^[11]

Share this article:

This article uses material from the Wikipedia article Determining_the_number_of_clusters_in_a_data_set, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[1] [1]
See, e.g., David J. Ketchen Jr; Christopher L. Shook (1996). "The application of cluster analysis in Strategic Management Research: An analysis and critique". Strategic Management Journal. 17 (6): 441–458. doi:10.1002/(SICI)1097-0266(199606)17:6<441::AID-SMJ819>3.0.CO;2-G.^{[dead link]}

[2] [2]
Schubert, Erich (2023-06-22). "Stop using the elbow criterion for k-means and how to choose the number of clusters instead". ACM SIGKDD Explorations Newsletter. 25 (1): 36–42. arXiv:2212.12189. doi:10.1145/3606274.3606278. ISSN 1931-0145.

[3] [3]
See, e.g., Figure 6 in
Cyril Goutte, Peter Toft, Egill Rostrup, Finn Årup Nielsen, Lars Kai Hansen (March 1999). "On Clustering fMRI Time Series". NeuroImage. 9 (3): 298–310. doi:10.1006/nimg.1998.0391. PMID 10075900. S2CID 14147564.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[4] Cyril Goutte, Peter Toft, Egill Rostrup, Finn Årup Nielsen, Lars Kai Hansen (March 1999). "On Clustering fMRI Time Series". NeuroImage. 9 (3): 298–310. doi:10.1006/nimg.1998.0391. PMID 10075900. S2CID 14147564.{{cite journal}}: CS1 maint: multiple names: authors list (link)

[4] [4]
Robert L. Thorndike (December 1953). "Who Belongs in the Family?". Psychometrika. 18 (4): 267–276. doi:10.1007/BF02289263. S2CID 120467216.

[5] [5]
D. Pelleg; AW Moore. X-means: Extending K-means with Efficient Estimation of the Number of Clusters (PDF). Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000). Retrieved 2016-08-16.

[6] [6]
Cyril Goutte, Lars Kai Hansen, Matthew G. Liptrot & Egill Rostrup (2001). "Feature-Space Clustering for fMRI Meta-Analysis". Human Brain Mapping. 13 (3): 165–183. doi:10.1002/hbm.1031. PMC 6871985. PMID 11376501.{{cite journal}}: CS1 maint: multiple names: authors list (link) see especially Figure 14 and appendix.

[7] [7]
Catherine A. Sugar; Gareth M. James (2003). "Finding the number of clusters in a data set: An information-theoretic approach". Journal of the American Statistical Association. 98 (January): 750–763. doi:10.1198/016214503000000666. S2CID 120113332.

[8] [8]
Peter J. Rousseuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65. doi:10.1016/0377-0427(87)90125-7.

[9] [9]
R. Lleti; M.C. Ortiz; L.A. Sarabia; M.S. Sánchez (2004). "Selecting Variables for k-Means Cluster Analysis by Using a Genetic Algorithm that Optimises the Silhouettes". Analytica Chimica Acta. 515: 87–100. doi:10.1016/j.aca.2003.12.020.

[10] [10]
R.C. de Amorim & C. Hennig (2015). "Recovering the number of clusters in data sets with noise features using feature rescaling factors". Information Sciences. 324: 126–145. arXiv:1602.06989. doi:10.1016/j.ins.2015.06.039. S2CID 315803.

[11] [11]
Can, F.; Ozkarahan, E. A. (1990). "Concepts and effectiveness of the cover-coefficient-based clustering methodology for text databases". ACM Transactions on Database Systems. 15 (4): 483. doi:10.1145/99935.99938. hdl:2374.MIA/246. S2CID 14309214. especially see Section 2.7.

[12] [12]
Honarkhah, M; Caers, J (2010). "Stochastic Simulation of Patterns Using Distance-Based Pattern Modeling". Mathematical Geosciences. 42 (5): 487–517. doi:10.1007/s11004-010-9276-7. S2CID 73657847.

[13] [13]
Robert Tibshirani; Guenther Walther; Trevor Hastie (2001). "Estimating the number of clusters in a data set via the gap statistic". Journal of the Royal Statistical Society, Series B. 63 (2): 411–423. doi:10.1111/1467-9868.00293. S2CID 59738652.

[14] [14]
Brodinová, Šárka; Filzmoser, Peter; Ortner, Thomas; Breiteneder, Christian; Rohm, Maia (2019-03-19). "Robust and sparse k-means clustering for high-dimensional data". Advances in Data Analysis and Classification. arXiv:1709.10012. doi:10.1007/s11634-019-00356-9. ISSN 1862-5347.

[15] [15]
"cluster R package". 28 March 2022.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

Determining_the_number_of_clusters_in_a_data_set

Determining the number of clusters in a data set

Elbow method

X-means clustering

Information criterion approach

Information–theoretic approach

Silhouette method

Cross-validation

Finding number of clusters in text databases

Analyzing the kernel matrix

The gap statistics

References

External links

Share this article: