Significant BLEU Score Gap Between evaluate and pycocoevalcap in Comma-Separated vs. Period-Separated Lists

Hi, 

I encountered a significant discrepancy between BLEU scores computed using evaluate and pycocoevalcap, even when using the same predictions and references that differ only slightly.

I conducted two parallel experiments where both the predictions and references follow the same structural formatting, but one uses comma-separated lists while the other uses period-separated short sentences. The reference and prediction differ by one term only.

The BLEU scores from evaluate remain high in both cases, while the scores from pycocoevalcap drop significantly in the comma-separated case.

Here is the test case:

```
import evaluate
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.meteor.meteor import Meteor
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize

# Load metrics
bleu_eval = evaluate.load("bleu")
bleu_pycoco = Bleu()


# ---------------------------
# Experiment 1: Comma-based
# ---------------------------
preds = ["opacity, consolidation, pleural effusion, and atelectasis are present."]
refs = ["opacity, consolidation, pleural effusion, and pneumonia are present."]
print("evaluate BLEU-4 (comma):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])


gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_1 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (comma):", bleu_scores_1[0][3])


# ---------------------------
# Experiment 2: Period-based
# ---------------------------
preds = ["opacity . consolidation . pleural effusion . atelectasis are present ."]
refs = ["opacity . consolidation . pleural effusion . pneumonia are present ."]
print("evaluate BLEU-4 (period):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])


gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_2 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (period):", bleu_scores_2[0][3])
```

The output is:

pycocoevalcap BLEU-4 (comma): 0.5946035573327129
evaluate BLEU-4 (period): 0.7016879391277372

evaluate BLEU-4 (period): 0.7016879391277372
pycocoevalcap BLEU-4 (period): 0.7016879389890388

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Significant BLEU Score Gap Between evaluate and pycocoevalcap in Comma-Separated vs. Period-Separated Lists #693

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Significant BLEU Score Gap Between evaluate and pycocoevalcap in Comma-Separated vs. Period-Separated Lists #693

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions