-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Re-ticketed from @nicobao's Discord message:
Hi there, in one of our conversation using reddwarf, we have a cluster with just 1 participant in. Our users find it odd. Is there a parameter to make sure we don't have 1-participant clusters?
Thanks! You're totally right -- this is an edge-case of upstream Polis that has not yet been implemented in reddwarf.
Why this happens
This "singleton cluster" occurs because singleton clusters are totally valid findings for kmeans, and will be scored (sometimes high enough to be selected) by the silhouette scoring algorithm that chooses the "best k" for us.
There are two obvious ways to avoid singleton clusters:
- advanced & rigorous: choosing a modified KMeans algorithm that allows setting arbitrary min/max bounds on cluster size
- simple & quick: modifying silhouette scoring algorithm to simply invalidate any "k" values with a "zero" score when they result in a specific condition (like a singleton cluster or a cluster outside a min/max threshold)
Upstream polis platform does this via (2):
https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/clusters.clj#L350-L354
The approach of (1) has a few packages that support it:
- https://github.com/joshlk/k-means-constrained
- #todo add more options from this research spreadsheet
Probably the quickest and simplest option is to hack the sklearn silhouette_score() function to be more like the Polis alteration:
We'd then simply substitute it into where we use the original scoring function:
| scoring=scoring_function, |