Skip to content

NikeHop/automatic_scientific_quality_metrics

Repository files navigation

Code for the paper "Automatic Evaluation Metrics for Artificially Generated Scientific Research"

Code style: black

This repo contains the code for the paper: "Automatic Evaluation Metrics for Artificially Generated Scientific Research".

Reviewer Performance

Dependencies

Requirements for running the experiments:

  • anaconda3/miniconda3

First clone the repo and then create a new conda environment with the necessary requirements:

git clone https://github.com/NikeHop/automatic_scientific_quality_metrics.git
cd automatic_scientific_quality_metrics
conda create --name scientific_qm python=3.11
conda activate scientific_qm
pip install -e .
python -c "import nltk; nltk.download('punkt_tab')"

All the following commands require to be run in the scientific_qm environment. Results are logged to wandb.

Datasets

The datasets used in this project can be found on Huggingface:

This section describes how to obtain the missing parsed pdfs of the submissions, for the datasets:

  • openreview-iclr
  • openreview-neurips
  • openreview-full

You will need tmux. For Ubuntu/Debian install it via:

sudo apt update
sudo apt install tmux

All commands should be run from the ./automatic_scientific_qm/data_processing directory. First get GROBID by running:

bash ./scripts/setup_grobid.sh

Test whether GROBID runs:

bash ./scripts/run_grobid.sh

If you run into trouble setting up GROBID have a look at the git issues here. If GROBID works, run the script:

bash ./scripts/complete_openreview_dataset.sh

Train Section Classifier

To train the section classifier on sections from the ACL-OCL dataset run from the ./automatic_scientific_qm/section_classification directory:

python train.py --config ./configs/train_acl_ocl.yaml

Running the code for the first time will embed the dataset using SPECTER2, which takes ~2hr.

Train Score Predictors

To train the score prediction models run from the ./automatic_scientific_qm/score_prediction directory the following command:

python train.py --config ./configs/name_of_config.yaml

Citation Count prediction ACL-OCL

For the citation count prediction models on the ACL-OCL dataset use/modify the config acl_ocl_citation_prediction.yaml:

Parameter Values
data/paper_representation title_abstract,
intro,
conclusion,
background,
method,
result_experiment,
hypothesis
data/context_type no_context,
references,
full_paper

Score prediction OpenReview

For the score prediction models on the OpenReview dataset use/modify the openreview_score_prediction_*.yaml:

Parameter Values
data/dataset openreview-full,
openreview-iclr,
openreview-neurips
data/score_type avg_citations_per_month,
mean_score,
mean_impact
data/paper_representation title_abstract,
hypothesis
data/context_type no_context,
references,
full_paper

If the code is run for the first time for each dataset the text of the datasets will be embedded using SPECTER2, which takes ~3hr.

Run LLM Reviewers

All commands should be run from the ./automatic_scientific_qm/llm_reviewing directory.

Note 1:

This section requires that the missing PDF submissions for openreview-iclr and openreview-neurips are parsed (see here).

Note 2:

This section requires an API key for either OpenAI or Anthropic. If you want to use the Anthropic API, store the API key as an environment variable in ANTHROPIC_API_KEY. If you want to use the OpenAI API, store the API key as an environment variable in OPENAI_API_KEY.

Note 3:

Download the necessary data by running

bash ./scripts/download.sh

Running Review Models

We run the following two LLM-reviewers on a subset of ICLR-2024 and NeurIPS-2024 submissions:

(1) Sakana's LLM reviewer

The first argument specifies the llm_provider, currently only openai and anthropic are supported

bash ./scripts/run_sakana_reviewer.sh openai

(2) Paiwise Comparison Reviewer.

bash ./scripts/run_llm_pairwise_reviewer.sh openai

(3) Run score prediction models on subsets

bash ./scripts/run_rsp_reviewer.sh

Acknowledgements

The code makes use of the following repos:

Citation

If you make use of this codebase, please cite

@article{hopner2025automatic,
  title={Automatic Evaluation Metrics for Artificially Generated Scientific Research},
  author={H{\"o}pner, Niklas and Eshuijs, Leon and Alivanistos, Dimitrios and Zamprogno, Giacomo and Tiddi, Ilaria},
  journal={arXiv preprint arXiv:2503.05712},
  year={2025}
}

About

Code for the paper: "Automatic Evaluation Metrics for Artificially Generated Scientific Research" (https://arxiv.org/abs/2503.05712)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published