K-means clustering is one of the partitioning methods to cluster $n$ objects into $k$ clusters using the cluster centroid. The algorithm for k-means clustering is quite simple, and runs in $O(nkt)$ time for $t$ iterations.
1. Pick $k$ initial centroids at random (or based on some domain knowledge)
2. Assign each object to the nearest centroid
3. Update each centroid based on the objects assigned to its cluster (compute the new mean position)
4. Continue until clusters stabilize
You must specify $k$ a priori and the method for finding a centroid should be well defined (e.g., mean of numerical data). K-means will not work for non-convex shapes (e.g., a donut). K-means is also sensitive to noise and outliers.
**K-medoids clustering** is a related method that uses the "central" object of the cluster (the object closes to the centroid) to reduce sensitivity to noise and outliers. This method is slightly more computationally expensive as the central object must be identified on each pass.
A basic implementation in [[Python]] is provided below.
```python
import numpy as np
def k_means_clustering(centroids, dataset, max_iter=10):
"""
General K-Means clustering implementation.
Parameters:
centroids (list of list): Initial centroids, e.g., [[x1, y1], [x2, y2], ...]
dataset (list of list): Data points, e.g., [[x, y], [x, y], ...]
max_iter (int): Number of iterations
Returns:
results (list of dict): One entry per iteration with cluster assignments and centroids
"""
K = len(centroids)
dataset = np.array(dataset)
results = []
for iteration in range(max_iter):
clusters = {k: [] for k in range(K)}
# Assignment step
for point in dataset:
distances = [np.linalg.norm(point - c) for c in centroids]
cluster_idx = np.argmin(distances)
clusters[cluster_idx].append(point.tolist())
# Update step
new_centroids = []
for k in range(K):
points = np.array(clusters[k])
if len(points) > 0:
new_centroid = np.mean(points, axis=0).tolist()
else:
# If no points assigned to this cluster, keep old centroid
new_centroid = centroids[k]
new_centroids.append(new_centroid)
results.append({
'iteration': iteration + 1,
'clusters': clusters,
'centroids': new_centroids
})
centroids = new_centroids
return results
```