Canopy_clustering_algorithm

Canopy clustering algorithm

Add article description

The canopy clustering algorithm is an unsupervised pre-clustering algorithm introduced by Andrew McCallum, Kamal Nigam and Lyle Ungar in 2000.^[1] It is often used as preprocessing step for the K-means algorithm or the Hierarchical clustering algorithm. It is intended to speed up clustering operations on large data sets, where using another algorithm directly may be impractical due to the size of the data set.

Description

The algorithm proceeds as follows, using two thresholds $T_{1}$ (the loose distance) and $T_{2}$ (the tight distance), where $T_{1}>T_{2}$ .^[1]^[2]

Begin with the set of data points to be clustered.
Remove a point from the set, beginning a new 'canopy' containing this point.
For each point left in the set, assign it to the new canopy if its distance to the first point of the canopy is less than the loose distance $T_{1}$ .
If the distance of the point is additionally less than the tight distance $T_{2}$ , remove it from the original set.
Repeat from step 2 until there are no more data points in the set to cluster.
These relatively cheaply clustered canopies can be sub-clustered using a more expensive but accurate algorithm.

An important note is that individual data points may be part of several canopies. As an additional speed-up, an approximate and fast distance metric can be used for 3, where a more accurate and slow distance metric can be used for step 4.

Share this article:

This article uses material from the Wikipedia article Canopy_clustering_algorithm, and is written by contributors. Text is available under a CC BY-SA 4.0 International License; additional terms may apply. Images, videos and audio are available under their respective licenses.

[original-1] [1]
McCallum, A.; Nigam, K.; and Ungar L.H. (2000) "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, 169-178 doi:10.1145/347090.347123

[2] [2]
"The Canopies Algorithm". courses.cs.washington.edu. Retrieved 2014-09-06.

[3] [3]
Mahout description of Canopy-Clustering Retrieved 2022-07-02.

[1]

[2]

[3]

Canopy_clustering_algorithm

Canopy clustering algorithm

Description

Applicability

References

Share this article: