Clustering is a technique in [[unsupervised learning]] for discovering related groups of elements.
An initial evaluation can be used to understand clustering tendency, which is necessary for clustering techniques to work. The higher the cluster cohesion and separation, the easier it will be to discover clusters. For many methods, the number of clusters will need to be defined a priori. Use silhouette coefficient to help estimate.
## clustering methods
Types of clustering methods include
- [[#partitioning methods]] (e.g., k-means)
- [[#hierarchical methods]] (e.g., dendrogram)
- [[#grid-based methods]]
- [[#density-based methods]]
- [[#probabilistic methods]]
- [[#graph methods]]
### partitioning methods
Partitioning methods partition $n$ objects into $k$ clusters. Partitioning methods include
- [[k-means clustering]]
### hierarchical methods
Hierarchical methods attempt to discover levels of clustering. A [[dendrogram]] is a common hierarchical method for clustering. The dendrogram can agglomerative (bottom up merging) or decisive (top-down splitting).
Hierarchical methods avoid the need to define the number of cluster a priori, but you must have a clear sense of when elements should be grouped or split (the cluster distance).
### grid-based methods
Grid-based clustering methods apply a grid structure to discover clusters. The complexity of the approach depends on the number of cells specified, but it is easy to parallelize the computation.
### density-based methods
Density-based methods focus on the density of clusters. For example, growing local neighborhoods so long as its density does not drop below some threshold.
- [[DBSCAN]]: connected dense neighborhoods
- [[DENCLUE]]: sum of local influence functions
Density-based methods are useful for defining arbitrary cluster shapes and is noise tolerant. Density approaches can be achieved in a single pass of the data $O(n)$.
### probabilistic methods
In some cases, instead of a single label for a cluster we want to identify a probability distribution over all cluster labels for each object. Under the assumption that the data $X$ were generated by a mixture of probabilistic models $C$, we can model the probability of assignment to each model as
$P(X | C) = \prod_{i=1}^n P(x_i | C) = \prod_{i=1}^n \sum_{j-1}^k w_j f_j(x_i)$
A common probabilistic method is [[expectation maximization]].
### graph methods
Graph methods for clustering are used for graph data (e.g., community detection). Graph clusters are well-connected substructures in the graph.
## high-dimensional data
Clustering can suffer in high-dimensional spaces due to the [[curse of dimensionality]]. Dimensionality reduction and subspace clustering are useful. The general idea is to project the data into a lower dimensional space to identify clusters. Projecting these clusters back to high-dimensional space gives you the final clusters. Alternatively, bi-clustering clusters both objects and attributes.
## constraint based clustering
Constraints can be introduced to clustering algorithms to focus on specific clustering outcomes, improve computation or reflect domain knowledge.