-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Existing data:
Currently run_clustering returns repness, which is useful:
- https://github.com/polis-community/red-dwarf/blob/main/reddwarf/implementations/base.py#L57
- https://github.com/polis-community/red-dwarf/blob/main/reddwarf/types/polis.py#L56
- https://github.com/polis-community/red-dwarf/blob/main/reddwarf/types/polis.py#L42
red-dwarf/reddwarf/utils/stats.py
Lines 531 to 616 in 1d6ed6b
def select_representative_statements( grouped_stats_df: pd.DataFrame, mod_out_statement_ids: list[int] = [], pick_max: int = 5, confidence: float = 0.90, ) -> PolisRepness: """ Selects statistically representative statements from each group cluster. This is expected to match the Polis outputs when all defaults are set. Args: grouped_stats_df (pd.DataFrame): MultiIndex Dataframe of statement statistics, indexed by group and statement. mod_out_statement_ids (list[int]): A list of statements to ignore from selection algorithm pick_max (int): Max number of statements selected per group confidence (float): Percent confidence interval (in decimal), within which selected statements are deemed significant Returns: PolisRepness: A dict object with lists of statements keyed to groups, matching Polis format. """ repness = {} # TODO: Should this be done elsewhere? A column in MultiIndex dataframe? mod_out_mask = grouped_stats_df.index.get_level_values("statement_id").isin( mod_out_statement_ids ) grouped_stats_df = grouped_stats_df[~mod_out_mask] # type: ignore for gid, group_df in grouped_stats_df.groupby(level="group_id"): # Bring statement_id into regular column. group_df = group_df.reset_index() best_agree = None # Track the best-agree, to bring to top if exists. for _, row in group_df.iterrows(): if beats_best_of_agrees(row, best_agree, confidence): best_agree = row sig_filter = lambda row: is_statement_significant(row, confidence) sufficient_statements_row_mask = group_df.apply(sig_filter, axis="columns") sufficient_statements = group_df[sufficient_statements_row_mask] # Track the best, even if doesn't meet sufficient minimum, to have at least one. best_overall = None if len(sufficient_statements) == 0: for _, row in group_df.iterrows(): if beats_best_by_repness_test(row, best_overall): best_overall = row else: # Finalize statements into output format. # TODO: Figure out how to finalize only at end in output. Change repness_metric? sufficient_statements = ( pd.DataFrame( [ format_comment_stats(row) for _, row in sufficient_statements.iterrows() ] ) # Create a column to sort repnress, then remove. .assign(repness_metric=repness_metric) .sort_values(by="repness_metric", ascending=False) .drop(columns="repness_metric") ) if best_agree is not None: best_agree = format_comment_stats(best_agree) best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True}) best_head = [best_agree] elif best_overall is not None: best_overall = format_comment_stats(best_overall) best_head = [best_overall] else: best_head = [] selected = best_head selected = selected + [ row.to_dict() for _, row in sufficient_statements.iterrows() if best_head # Skip any statements already in best_head and best_head[0]["tid"] != row["tid"] ] selected = selected[:pick_max] # Does the work of agrees-before-disagrees sort in polismath, since "a" before "d". selected = sorted(selected, key=lambda row: row["repful-for"]) repness[gid] = selected return repness # type:ignore
Like with polis, it returns only for 5 top most representative statements for each cluster.
Expected changes
We want to keep repness.
But on top of that, I'd like users to be able to visualize the list of all the statements, ranked by representativeness for a given cluster.
We'd call this new data repness_all or something. It would be a dict, with key==cluster_id and value==another dict, with another dict of type key==tid and value==probability of representativeness like group-aware-consensus
example for 10 statements and 5 clusters:
repness_all = {
"0": {
"0": 0.51,
"1": 0.34,
"2": 0.68,
"3": 0.27,
"4": 0.91,
"5": 0.16,
"6": 0.43,
"7": 0.78,
"8": 0.03,
"9": 0.89,
"10": 0.25
},
"1": {
"0": 0.47,
"1": 0.22,
"2": 0.38,
"3": 0.76,
"4": 0.64,
"5": 0.10,
"6": 0.93,
"7": 0.50,
"8": 0.07,
"9": 0.84,
"10": 0.19
},
"2": {
"0": 0.12,
"1": 0.95,
"2": 0.41,
"3": 0.66,
"4": 0.33,
"5": 0.74,
"6": 0.27,
"7": 0.58,
"8": 0.89,
"9": 0.20,
"10": 0.08
},
"3": {
"0": 0.36,
"1": 0.87,
"2": 0.59,
"3": 0.11,
"4": 0.72,
"5": 0.03,
"6": 0.94,
"7": 0.67,
"8": 0.49,
"9": 0.25,
"10": 0.81
},
"4": {
"0": 0.02,
"1": 0.60,
"2": 0.45,
"3": 0.91,
"4": 0.18,
"5": 0.35,
"6": 0.77,
"7": 0.23,
"8": 0.98,
"9": 0.13,
"10": 0.55
}
}Implementation details
We'd want to base the ranking off this algorithm as a start (meaning that we want to keep this alg as is for repness but we want to create a new function based on part of this code for repness_all):
red-dwarf/reddwarf/utils/stats.py
Lines 531 to 616 in 1d6ed6b
def select_representative_statements( grouped_stats_df: pd.DataFrame, mod_out_statement_ids: list[int] = [], pick_max: int = 5, confidence: float = 0.90, ) -> PolisRepness: """ Selects statistically representative statements from each group cluster. This is expected to match the Polis outputs when all defaults are set. Args: grouped_stats_df (pd.DataFrame): MultiIndex Dataframe of statement statistics, indexed by group and statement. mod_out_statement_ids (list[int]): A list of statements to ignore from selection algorithm pick_max (int): Max number of statements selected per group confidence (float): Percent confidence interval (in decimal), within which selected statements are deemed significant Returns: PolisRepness: A dict object with lists of statements keyed to groups, matching Polis format. """ repness = {} # TODO: Should this be done elsewhere? A column in MultiIndex dataframe? mod_out_mask = grouped_stats_df.index.get_level_values("statement_id").isin( mod_out_statement_ids ) grouped_stats_df = grouped_stats_df[~mod_out_mask] # type: ignore for gid, group_df in grouped_stats_df.groupby(level="group_id"): # Bring statement_id into regular column. group_df = group_df.reset_index() best_agree = None # Track the best-agree, to bring to top if exists. for _, row in group_df.iterrows(): if beats_best_of_agrees(row, best_agree, confidence): best_agree = row sig_filter = lambda row: is_statement_significant(row, confidence) sufficient_statements_row_mask = group_df.apply(sig_filter, axis="columns") sufficient_statements = group_df[sufficient_statements_row_mask] # Track the best, even if doesn't meet sufficient minimum, to have at least one. best_overall = None if len(sufficient_statements) == 0: for _, row in group_df.iterrows(): if beats_best_by_repness_test(row, best_overall): best_overall = row else: # Finalize statements into output format. # TODO: Figure out how to finalize only at end in output. Change repness_metric? sufficient_statements = ( pd.DataFrame( [ format_comment_stats(row) for _, row in sufficient_statements.iterrows() ] ) # Create a column to sort repnress, then remove. .assign(repness_metric=repness_metric) .sort_values(by="repness_metric", ascending=False) .drop(columns="repness_metric") ) if best_agree is not None: best_agree = format_comment_stats(best_agree) best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True}) best_head = [best_agree] elif best_overall is not None: best_overall = format_comment_stats(best_overall) best_head = [best_overall] else: best_head = [] selected = best_head selected = selected + [ row.to_dict() for _, row in sufficient_statements.iterrows() if best_head # Skip any statements already in best_head and best_head[0]["tid"] != row["tid"] ] selected = selected[:pick_max] # Does the work of agrees-before-disagrees sort in polismath, since "a" before "d". selected = sorted(selected, key=lambda row: row["repful-for"]) repness[gid] = selected return repness # type:ignore
Some early hints for an implementation solution:
pick_maxwould be irrelevant.mod_out_statements: well that's just moderations, so yeah.- Then for
confidence, it should not be passed as param anymore, because we want to re-write this algorithm to have a ranking of representativeness instead of a given threshold => highest confidence found == highest rank.