-
Notifications
You must be signed in to change notification settings - Fork 292
Description
Hi,
I encountered a significant discrepancy between BLEU scores computed using evaluate and pycocoevalcap, even when using the same predictions and references that differ only slightly.
I conducted two parallel experiments where both the predictions and references follow the same structural formatting, but one uses comma-separated lists while the other uses period-separated short sentences. The reference and prediction differ by one term only.
The BLEU scores from evaluate remain high in both cases, while the scores from pycocoevalcap drop significantly in the comma-separated case.
Here is the test case:
import evaluate
from pycocoevalcap.bleu.bleu import Bleu
from pycocoevalcap.rouge.rouge import Rouge
from pycocoevalcap.meteor.meteor import Meteor
from nltk.translate.bleu_score import sentence_bleu
from nltk.tokenize import word_tokenize
# Load metrics
bleu_eval = evaluate.load("bleu")
bleu_pycoco = Bleu()
# ---------------------------
# Experiment 1: Comma-based
# ---------------------------
preds = ["opacity, consolidation, pleural effusion, and atelectasis are present."]
refs = ["opacity, consolidation, pleural effusion, and pneumonia are present."]
print("evaluate BLEU-4 (comma):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])
gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_1 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (comma):", bleu_scores_1[0][3])
# ---------------------------
# Experiment 2: Period-based
# ---------------------------
preds = ["opacity . consolidation . pleural effusion . atelectasis are present ."]
refs = ["opacity . consolidation . pleural effusion . pneumonia are present ."]
print("evaluate BLEU-4 (period):", bleu_eval.compute(predictions=preds, references=refs, max_order=4)["bleu"])
gt_dict = {'1': [refs[0]]}
hypo_dict = {'1': [preds[0]]}
bleu_scores_2 = bleu_pycoco.compute_score(gt_dict, hypo_dict)
print("pycocoevalcap BLEU-4 (period):", bleu_scores_2[0][3])
The output is:
pycocoevalcap BLEU-4 (comma): 0.5946035573327129
evaluate BLEU-4 (period): 0.7016879391277372
evaluate BLEU-4 (period): 0.7016879391277372
pycocoevalcap BLEU-4 (period): 0.7016879389890388