feat: rank all the statements by representativeness for a given cluster

## Existing data:

Currently `run_clustering` returns `repness`, which is useful:

- https://github.com/polis-community/red-dwarf/blob/main/reddwarf/implementations/base.py#L57
- https://github.com/polis-community/red-dwarf/blob/main/reddwarf/types/polis.py#L56
- https://github.com/polis-community/red-dwarf/blob/main/reddwarf/types/polis.py#L42
- https://github.com/polis-community/red-dwarf/blob/1d6ed6bfe793d43bcdfe3a0f7e13e0744d864d09/reddwarf/utils/stats.py#L531-L616

Like with polis, it returns only for 5 top most representative statements for each cluster.

## Expected changes

We want to keep `repness`. 

But on top of that, I'd like users to be able to visualize the list of all the statements, *ranked by representativeness for a given cluster*.

We'd call this new data `repness_all` or something. It would be a `dict`, with `key==cluster_id` and `value==another dict`, with another dict of type `key==tid` and `value==probability of representativeness` like `group-aware-consensus`

example for 10 statements and 5 clusters:
```js
repness_all = {
    "0": {
        "0": 0.51,
        "1": 0.34,
        "2": 0.68,
        "3": 0.27,
        "4": 0.91,
        "5": 0.16,
        "6": 0.43,
        "7": 0.78,
        "8": 0.03,
        "9": 0.89,
        "10": 0.25
    },
    "1": {
        "0": 0.47,
        "1": 0.22,
        "2": 0.38,
        "3": 0.76,
        "4": 0.64,
        "5": 0.10,
        "6": 0.93,
        "7": 0.50,
        "8": 0.07,
        "9": 0.84,
        "10": 0.19
    },
    "2": {
        "0": 0.12,
        "1": 0.95,
        "2": 0.41,
        "3": 0.66,
        "4": 0.33,
        "5": 0.74,
        "6": 0.27,
        "7": 0.58,
        "8": 0.89,
        "9": 0.20,
        "10": 0.08
    },
    "3": {
        "0": 0.36,
        "1": 0.87,
        "2": 0.59,
        "3": 0.11,
        "4": 0.72,
        "5": 0.03,
        "6": 0.94,
        "7": 0.67,
        "8": 0.49,
        "9": 0.25,
        "10": 0.81
    },
    "4": {
        "0": 0.02,
        "1": 0.60,
        "2": 0.45,
        "3": 0.91,
        "4": 0.18,
        "5": 0.35,
        "6": 0.77,
        "7": 0.23,
        "8": 0.98,
        "9": 0.13,
        "10": 0.55
    }
}
```

## Implementation details

We'd want to base the ranking off this algorithm as a start (meaning that we want to keep this alg as is for `repness` but we want to create a new function based on part of this code for `repness_all`):
- https://github.com/polis-community/red-dwarf/blob/1d6ed6bfe793d43bcdfe3a0f7e13e0744d864d09/reddwarf/utils/stats.py#L531-L616

Some early hints for an implementation solution:
- `pick_max` would be irrelevant.
- `mod_out_statements`: well that's just moderations, so yeah.
- Then for `confidence`, it should not be passed as param anymore, because we want to re-write this algorithm to have a ranking of representativeness instead of a given threshold => highest confidence found == highest rank. 

	def select_representative_statements(
	grouped_stats_df: pd.DataFrame,
	mod_out_statement_ids: list[int] = [],
	pick_max: int = 5,
	confidence: float = 0.90,
	) -> PolisRepness:
	"""
	Selects statistically representative statements from each group cluster.

	This is expected to match the Polis outputs when all defaults are set.

	Args:
	grouped_stats_df (pd.DataFrame): MultiIndex Dataframe of statement statistics, indexed by group and statement.
	mod_out_statement_ids (list[int]): A list of statements to ignore from selection algorithm
	pick_max (int): Max number of statements selected per group
	confidence (float): Percent confidence interval (in decimal), within which selected statements are deemed significant

	Returns:
	PolisRepness: A dict object with lists of statements keyed to groups, matching Polis format.
	"""
	repness = {}
	# TODO: Should this be done elsewhere? A column in MultiIndex dataframe?
	mod_out_mask = grouped_stats_df.index.get_level_values("statement_id").isin(
	mod_out_statement_ids
	)
	grouped_stats_df = grouped_stats_df[~mod_out_mask] # type: ignore
	for gid, group_df in grouped_stats_df.groupby(level="group_id"):
	# Bring statement_id into regular column.
	group_df = group_df.reset_index()

	best_agree = None
	# Track the best-agree, to bring to top if exists.
	for _, row in group_df.iterrows():
	if beats_best_of_agrees(row, best_agree, confidence):
	best_agree = row

	sig_filter = lambda row: is_statement_significant(row, confidence)
	sufficient_statements_row_mask = group_df.apply(sig_filter, axis="columns")
	sufficient_statements = group_df[sufficient_statements_row_mask]

	# Track the best, even if doesn't meet sufficient minimum, to have at least one.
	best_overall = None
	if len(sufficient_statements) == 0:
	for _, row in group_df.iterrows():
	if beats_best_by_repness_test(row, best_overall):
	best_overall = row
	else:
	# Finalize statements into output format.
	# TODO: Figure out how to finalize only at end in output. Change repness_metric?
	sufficient_statements = (
	pd.DataFrame(
	[
	format_comment_stats(row)
	for _, row in sufficient_statements.iterrows()
	]
	)
	# Create a column to sort repnress, then remove.
	.assign(repness_metric=repness_metric)
	.sort_values(by="repness_metric", ascending=False)
	.drop(columns="repness_metric")
	)

	if best_agree is not None:
	best_agree = format_comment_stats(best_agree)
	best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True})
	best_head = [best_agree]
	elif best_overall is not None:
	best_overall = format_comment_stats(best_overall)
	best_head = [best_overall]
	else:
	best_head = []

	selected = best_head
	selected = selected + [
	row.to_dict()
	for _, row in sufficient_statements.iterrows()
	if best_head
	# Skip any statements already in best_head
	and best_head[0]["tid"] != row["tid"]
	]
	selected = selected[:pick_max]
	# Does the work of agrees-before-disagrees sort in polismath, since "a" before "d".
	selected = sorted(selected, key=lambda row: row["repful-for"])
	repness[gid] = selected

	return repness # type:ignore

	def select_representative_statements(
	grouped_stats_df: pd.DataFrame,
	mod_out_statement_ids: list[int] = [],
	pick_max: int = 5,
	confidence: float = 0.90,
	) -> PolisRepness:
	"""
	Selects statistically representative statements from each group cluster.

	This is expected to match the Polis outputs when all defaults are set.

	Args:
	grouped_stats_df (pd.DataFrame): MultiIndex Dataframe of statement statistics, indexed by group and statement.
	mod_out_statement_ids (list[int]): A list of statements to ignore from selection algorithm
	pick_max (int): Max number of statements selected per group
	confidence (float): Percent confidence interval (in decimal), within which selected statements are deemed significant

	Returns:
	PolisRepness: A dict object with lists of statements keyed to groups, matching Polis format.
	"""
	repness = {}
	# TODO: Should this be done elsewhere? A column in MultiIndex dataframe?
	mod_out_mask = grouped_stats_df.index.get_level_values("statement_id").isin(
	mod_out_statement_ids
	)
	grouped_stats_df = grouped_stats_df[~mod_out_mask] # type: ignore
	for gid, group_df in grouped_stats_df.groupby(level="group_id"):
	# Bring statement_id into regular column.
	group_df = group_df.reset_index()

	best_agree = None
	# Track the best-agree, to bring to top if exists.
	for _, row in group_df.iterrows():
	if beats_best_of_agrees(row, best_agree, confidence):
	best_agree = row

	sig_filter = lambda row: is_statement_significant(row, confidence)
	sufficient_statements_row_mask = group_df.apply(sig_filter, axis="columns")
	sufficient_statements = group_df[sufficient_statements_row_mask]

	# Track the best, even if doesn't meet sufficient minimum, to have at least one.
	best_overall = None
	if len(sufficient_statements) == 0:
	for _, row in group_df.iterrows():
	if beats_best_by_repness_test(row, best_overall):
	best_overall = row
	else:
	# Finalize statements into output format.
	# TODO: Figure out how to finalize only at end in output. Change repness_metric?
	sufficient_statements = (
	pd.DataFrame(
	[
	format_comment_stats(row)
	for _, row in sufficient_statements.iterrows()
	]
	)
	# Create a column to sort repnress, then remove.
	.assign(repness_metric=repness_metric)
	.sort_values(by="repness_metric", ascending=False)
	.drop(columns="repness_metric")
	)

	if best_agree is not None:
	best_agree = format_comment_stats(best_agree)
	best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True})
	best_head = [best_agree]
	elif best_overall is not None:
	best_overall = format_comment_stats(best_overall)
	best_head = [best_overall]
	else:
	best_head = []

	selected = best_head
	selected = selected + [
	row.to_dict()
	for _, row in sufficient_statements.iterrows()
	if best_head
	# Skip any statements already in best_head
	and best_head[0]["tid"] != row["tid"]
	]
	selected = selected[:pick_max]
	# Does the work of agrees-before-disagrees sort in polismath, since "a" before "d".
	selected = sorted(selected, key=lambda row: row["repful-for"])
	repness[gid] = selected

	return repness # type:ignore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: rank all the statements by representativeness for a given cluster #73

Existing data:

Expected changes

Implementation details

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

feat: rank all the statements by representativeness for a given cluster #73

Description

Existing data:

Expected changes

Implementation details

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions