This repo contains the code for the paper: "Automatic Evaluation Metrics for Artificially Generated Scientific Research".
- anaconda3/miniconda3
First clone the repo and then create a new conda environment with the necessary requirements:
git clone https://github.com/NikeHop/automatic_scientific_quality_metrics.git
cd automatic_scientific_quality_metrics
conda create --name scientific_qm python=3.11
conda activate scientific_qm
pip install -e .
python -c "import nltk; nltk.download('punkt_tab')"
All the following commands require to be run in the scientific_qm environment. Results are logged to wandb.
The datasets used in this project can be found on Huggingface:
This section describes how to obtain the missing parsed pdfs of the submissions, for the datasets:
- openreview-iclr
- openreview-neurips
- openreview-full
You will need tmux. For Ubuntu/Debian install it via:
sudo apt update
sudo apt install tmux
All commands should be run from the ./automatic_scientific_qm/data_processing directory. First get GROBID by running:
bash ./scripts/setup_grobid.sh
Test whether GROBID runs:
bash ./scripts/run_grobid.sh
If you run into trouble setting up GROBID have a look at the git issues here. If GROBID works, run the script:
bash ./scripts/complete_openreview_dataset.sh
To train the section classifier on sections from the ACL-OCL dataset run from the ./automatic_scientific_qm/section_classification directory:
python train.py --config ./configs/train_acl_ocl.yaml
Running the code for the first time will embed the dataset using SPECTER2, which takes ~2hr.
To train the score prediction models run from the ./automatic_scientific_qm/score_prediction directory the following command:
python train.py --config ./configs/name_of_config.yaml
For the citation count prediction models on the ACL-OCL dataset use/modify the config acl_ocl_citation_prediction.yaml:
| Parameter | Values | 
|---|---|
| data/paper_representation | title_abstract, intro, conclusion, background, method, result_experiment, hypothesis | 
| data/context_type | no_context, references, full_paper | 
For the score prediction models on the OpenReview dataset use/modify the openreview_score_prediction_*.yaml:
| Parameter | Values | 
|---|---|
| data/dataset | openreview-full, openreview-iclr, openreview-neurips | 
| data/score_type | avg_citations_per_month, mean_score, mean_impact | 
| data/paper_representation | title_abstract, hypothesis | 
| data/context_type | no_context, references, full_paper | 
If the code is run for the first time for each dataset the text of the datasets will be embedded using SPECTER2, which takes ~3hr.
All commands should be run from the ./automatic_scientific_qm/llm_reviewing directory.
Note 1:
This section requires that the missing PDF submissions for openreview-iclr and openreview-neurips are parsed (see here).
Note 2:
This section requires an API key for either OpenAI or Anthropic. If you want to use the Anthropic API, store the API key as an environment variable in ANTHROPIC_API_KEY. If you want to use the OpenAI API, store the API key as an environment variable in
OPENAI_API_KEY.
Note 3:
Download the necessary data by running
bash ./scripts/download.sh
We run the following two LLM-reviewers on a subset of ICLR-2024 and NeurIPS-2024 submissions:
The first argument specifies the llm_provider, currently only openai and anthropic are supported
bash ./scripts/run_sakana_reviewer.sh openai
(2) Paiwise Comparison Reviewer.
bash ./scripts/run_llm_pairwise_reviewer.sh openai
(3) Run score prediction models on subsets
bash ./scripts/run_rsp_reviewer.sh
The code makes use of the following repos:
- https://github.com/kermitt2/grobid (Apache-2.0 license)
- https://github.com/allenai/s2orc-doc2json (Apache-2.0 license)
- https://github.com/SakanaAI/AI-Scientist (Apache-2.0 license)
- https://github.com/NoviScl/AI-Researcher (MIT license)
If you make use of this codebase, please cite
@article{hopner2025automatic,
  title={Automatic Evaluation Metrics for Artificially Generated Scientific Research},
  author={H{\"o}pner, Niklas and Eshuijs, Leon and Alivanistos, Dimitrios and Zamprogno, Giacomo and Tiddi, Ilaria},
  journal={arXiv preprint arXiv:2503.05712},
  year={2025}
}
