Skip to content

feat: rank all the statements by representativeness for a given cluster #73

@nicobao

Description

@nicobao

Existing data:

Currently run_clustering returns repness, which is useful:

  • https://github.com/polis-community/red-dwarf/blob/main/reddwarf/implementations/base.py#L57
  • https://github.com/polis-community/red-dwarf/blob/main/reddwarf/types/polis.py#L56
  • https://github.com/polis-community/red-dwarf/blob/main/reddwarf/types/polis.py#L42
  • def select_representative_statements(
    grouped_stats_df: pd.DataFrame,
    mod_out_statement_ids: list[int] = [],
    pick_max: int = 5,
    confidence: float = 0.90,
    ) -> PolisRepness:
    """
    Selects statistically representative statements from each group cluster.
    This is expected to match the Polis outputs when all defaults are set.
    Args:
    grouped_stats_df (pd.DataFrame): MultiIndex Dataframe of statement statistics, indexed by group and statement.
    mod_out_statement_ids (list[int]): A list of statements to ignore from selection algorithm
    pick_max (int): Max number of statements selected per group
    confidence (float): Percent confidence interval (in decimal), within which selected statements are deemed significant
    Returns:
    PolisRepness: A dict object with lists of statements keyed to groups, matching Polis format.
    """
    repness = {}
    # TODO: Should this be done elsewhere? A column in MultiIndex dataframe?
    mod_out_mask = grouped_stats_df.index.get_level_values("statement_id").isin(
    mod_out_statement_ids
    )
    grouped_stats_df = grouped_stats_df[~mod_out_mask] # type: ignore
    for gid, group_df in grouped_stats_df.groupby(level="group_id"):
    # Bring statement_id into regular column.
    group_df = group_df.reset_index()
    best_agree = None
    # Track the best-agree, to bring to top if exists.
    for _, row in group_df.iterrows():
    if beats_best_of_agrees(row, best_agree, confidence):
    best_agree = row
    sig_filter = lambda row: is_statement_significant(row, confidence)
    sufficient_statements_row_mask = group_df.apply(sig_filter, axis="columns")
    sufficient_statements = group_df[sufficient_statements_row_mask]
    # Track the best, even if doesn't meet sufficient minimum, to have at least one.
    best_overall = None
    if len(sufficient_statements) == 0:
    for _, row in group_df.iterrows():
    if beats_best_by_repness_test(row, best_overall):
    best_overall = row
    else:
    # Finalize statements into output format.
    # TODO: Figure out how to finalize only at end in output. Change repness_metric?
    sufficient_statements = (
    pd.DataFrame(
    [
    format_comment_stats(row)
    for _, row in sufficient_statements.iterrows()
    ]
    )
    # Create a column to sort repnress, then remove.
    .assign(repness_metric=repness_metric)
    .sort_values(by="repness_metric", ascending=False)
    .drop(columns="repness_metric")
    )
    if best_agree is not None:
    best_agree = format_comment_stats(best_agree)
    best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True})
    best_head = [best_agree]
    elif best_overall is not None:
    best_overall = format_comment_stats(best_overall)
    best_head = [best_overall]
    else:
    best_head = []
    selected = best_head
    selected = selected + [
    row.to_dict()
    for _, row in sufficient_statements.iterrows()
    if best_head
    # Skip any statements already in best_head
    and best_head[0]["tid"] != row["tid"]
    ]
    selected = selected[:pick_max]
    # Does the work of agrees-before-disagrees sort in polismath, since "a" before "d".
    selected = sorted(selected, key=lambda row: row["repful-for"])
    repness[gid] = selected
    return repness # type:ignore

Like with polis, it returns only for 5 top most representative statements for each cluster.

Expected changes

We want to keep repness.

But on top of that, I'd like users to be able to visualize the list of all the statements, ranked by representativeness for a given cluster.

We'd call this new data repness_all or something. It would be a dict, with key==cluster_id and value==another dict, with another dict of type key==tid and value==probability of representativeness like group-aware-consensus

example for 10 statements and 5 clusters:

repness_all = {
    "0": {
        "0": 0.51,
        "1": 0.34,
        "2": 0.68,
        "3": 0.27,
        "4": 0.91,
        "5": 0.16,
        "6": 0.43,
        "7": 0.78,
        "8": 0.03,
        "9": 0.89,
        "10": 0.25
    },
    "1": {
        "0": 0.47,
        "1": 0.22,
        "2": 0.38,
        "3": 0.76,
        "4": 0.64,
        "5": 0.10,
        "6": 0.93,
        "7": 0.50,
        "8": 0.07,
        "9": 0.84,
        "10": 0.19
    },
    "2": {
        "0": 0.12,
        "1": 0.95,
        "2": 0.41,
        "3": 0.66,
        "4": 0.33,
        "5": 0.74,
        "6": 0.27,
        "7": 0.58,
        "8": 0.89,
        "9": 0.20,
        "10": 0.08
    },
    "3": {
        "0": 0.36,
        "1": 0.87,
        "2": 0.59,
        "3": 0.11,
        "4": 0.72,
        "5": 0.03,
        "6": 0.94,
        "7": 0.67,
        "8": 0.49,
        "9": 0.25,
        "10": 0.81
    },
    "4": {
        "0": 0.02,
        "1": 0.60,
        "2": 0.45,
        "3": 0.91,
        "4": 0.18,
        "5": 0.35,
        "6": 0.77,
        "7": 0.23,
        "8": 0.98,
        "9": 0.13,
        "10": 0.55
    }
}

Implementation details

We'd want to base the ranking off this algorithm as a start (meaning that we want to keep this alg as is for repness but we want to create a new function based on part of this code for repness_all):

  • def select_representative_statements(
    grouped_stats_df: pd.DataFrame,
    mod_out_statement_ids: list[int] = [],
    pick_max: int = 5,
    confidence: float = 0.90,
    ) -> PolisRepness:
    """
    Selects statistically representative statements from each group cluster.
    This is expected to match the Polis outputs when all defaults are set.
    Args:
    grouped_stats_df (pd.DataFrame): MultiIndex Dataframe of statement statistics, indexed by group and statement.
    mod_out_statement_ids (list[int]): A list of statements to ignore from selection algorithm
    pick_max (int): Max number of statements selected per group
    confidence (float): Percent confidence interval (in decimal), within which selected statements are deemed significant
    Returns:
    PolisRepness: A dict object with lists of statements keyed to groups, matching Polis format.
    """
    repness = {}
    # TODO: Should this be done elsewhere? A column in MultiIndex dataframe?
    mod_out_mask = grouped_stats_df.index.get_level_values("statement_id").isin(
    mod_out_statement_ids
    )
    grouped_stats_df = grouped_stats_df[~mod_out_mask] # type: ignore
    for gid, group_df in grouped_stats_df.groupby(level="group_id"):
    # Bring statement_id into regular column.
    group_df = group_df.reset_index()
    best_agree = None
    # Track the best-agree, to bring to top if exists.
    for _, row in group_df.iterrows():
    if beats_best_of_agrees(row, best_agree, confidence):
    best_agree = row
    sig_filter = lambda row: is_statement_significant(row, confidence)
    sufficient_statements_row_mask = group_df.apply(sig_filter, axis="columns")
    sufficient_statements = group_df[sufficient_statements_row_mask]
    # Track the best, even if doesn't meet sufficient minimum, to have at least one.
    best_overall = None
    if len(sufficient_statements) == 0:
    for _, row in group_df.iterrows():
    if beats_best_by_repness_test(row, best_overall):
    best_overall = row
    else:
    # Finalize statements into output format.
    # TODO: Figure out how to finalize only at end in output. Change repness_metric?
    sufficient_statements = (
    pd.DataFrame(
    [
    format_comment_stats(row)
    for _, row in sufficient_statements.iterrows()
    ]
    )
    # Create a column to sort repnress, then remove.
    .assign(repness_metric=repness_metric)
    .sort_values(by="repness_metric", ascending=False)
    .drop(columns="repness_metric")
    )
    if best_agree is not None:
    best_agree = format_comment_stats(best_agree)
    best_agree.update({"n-agree": best_agree["n-success"], "best-agree": True})
    best_head = [best_agree]
    elif best_overall is not None:
    best_overall = format_comment_stats(best_overall)
    best_head = [best_overall]
    else:
    best_head = []
    selected = best_head
    selected = selected + [
    row.to_dict()
    for _, row in sufficient_statements.iterrows()
    if best_head
    # Skip any statements already in best_head
    and best_head[0]["tid"] != row["tid"]
    ]
    selected = selected[:pick_max]
    # Does the work of agrees-before-disagrees sort in polismath, since "a" before "d".
    selected = sorted(selected, key=lambda row: row["repful-for"])
    repness[gid] = selected
    return repness # type:ignore

Some early hints for an implementation solution:

  • pick_max would be irrelevant.
  • mod_out_statements: well that's just moderations, so yeah.
  • Then for confidence, it should not be passed as param anymore, because we want to re-write this algorithm to have a ranking of representativeness instead of a given threshold => highest confidence found == highest rank.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions