Skip to content

Prevent singleton clusters from being displayed #97

@patcon

Description

@patcon

Re-ticketed from @nicobao's Discord message:

Hi there, in one of our conversation using reddwarf, we have a cluster with just 1 participant in. Our users find it odd. Is there a parameter to make sure we don't have 1-participant clusters?

Thanks! You're totally right -- this is an edge-case of upstream Polis that has not yet been implemented in reddwarf.

Why this happens

This "singleton cluster" occurs because singleton clusters are totally valid findings for kmeans, and will be scored (sometimes high enough to be selected) by the silhouette scoring algorithm that chooses the "best k" for us.

There are two obvious ways to avoid singleton clusters:

  1. advanced & rigorous: choosing a modified KMeans algorithm that allows setting arbitrary min/max bounds on cluster size
  2. simple & quick: modifying silhouette scoring algorithm to simply invalidate any "k" values with a "zero" score when they result in a specific condition (like a singleton cluster or a cluster outside a min/max threshold)

Upstream polis platform does this via (2):

https://github.com/compdemocracy/polis/blob/edge/math/src/polismath/math/clusters.clj#L350-L354

The approach of (1) has a few packages that support it:

Probably the quickest and simplest option is to hack the sklearn silhouette_score() function to be more like the Polis alteration:

https://github.com/scikit-learn/scikit-learn/blob/c5497b7f7/sklearn/metrics/cluster/_unsupervised.py#L51-L138

We'd then simply substitute it into where we use the original scoring function:

scoring=scoring_function,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions