Skip to content

Significant BLEU Score Gap Between evaluate and pycocoevalcap in Comma-Separated vs. Period-Separated Lists #693

@Yingshu-Li

Description

@Yingshu-Li

Hi,

I encountered a significant discrepancy between BLEU scores computed using evaluate and pycocoevalcap, even when using the same predictions and references that differ only slightly.

I conducted two parallel experiments where both the predictions and references follow the same structural formatting, but one uses comma-separated lists while the other uses period-separated short sentences. The reference and prediction differ by one term only.

The BLEU scores from evaluate remain high in both cases, while the scores from pycocoevalcap drop significantly in the comma-separated case.

Here is the test case:

import evaluate
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.meteor.meteor import Meteor
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

# Load metrics
bleu_eval = evaluate.load("bleu")
bleu_pycoco = Bleu()


# ---------------------------
# Experiment 1: Comma-based
# ---------------------------
preds = ["opacity, consolidation, pleural effusion, and atelectasis are present."]
refs = ["opacity, consolidation, pleural effusion, and pneumonia are present."]
print("evaluate BLEU-4 (comma):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])


gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_1 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (comma):", bleu_scores_1[0][3])


# ---------------------------
# Experiment 2: Period-based
# ---------------------------
preds = ["opacity . consolidation . pleural effusion . atelectasis are present ."]
refs = ["opacity . consolidation . pleural effusion . pneumonia are present ."]
print("evaluate BLEU-4 (period):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])


gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_2 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (period):", bleu_scores_2[0][3])

The output is:

pycocoevalcap BLEU-4 (comma): 0.5946035573327129
evaluate BLEU-4 (period): 0.7016879391277372

evaluate BLEU-4 (period): 0.7016879391277372
pycocoevalcap BLEU-4 (period): 0.7016879389890388

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions