- Partition-based clustering that produces sphere-like clusters
- Divides data into non-overlapping subsets with distance metric
- Examples in a clusters share similar pattern
(Can be found by minimizing intra-cluster distances)
- Examples in different clusters are different
(Can be found by maximizing inter-cluster distances)
- see : https:
Process:
1. Randomly select 3 points/centroids for clustering
2. Calculate distance of each datapoint from the centroids. The datapoint
belongs to the centroid closest to it.
3. Calculate SSE. The goal is to reduce the SSE (sum of squared errors)
4. Update centroids by estimating the mean of each cluster
5. Iterate from step 2 unless the value converges for centroids (centroids no
longer move much)
Issue: Result may or may not be the best outcome (may converge to local optima)
Resolve: Run the algorithm multiple times with different random starting points
Evaluation metric:
1. Compare with ground truth (If available. Normally no ground truth exists)
2. Cluster error : (How losely data form a cluster)
a. Average distance between data points
b. Average distance between centroid and datapoints
Choosing best K: (The elbow method)
- run the clustering algorithm for different values of k and choose the k
that gives the best accuracy and forms an elbow point in visualization.