REANIMATOR is a versatile framework for enhancing retrieval test collections, starting from a given set of DOIs or PDFs. It leverages a state-of-the-art PDF parsing pipeline (docling), langchain for preprocessing, and generates Umbrela-like synthetic relevance judgments. This allows for the creation of rich, new resources from existing test collections, enabling research on retrieval-augmented generation (RAG) and the impact of different data modalities like tables. For a detailed overview, please refer to our paper: REANIMATOR: Reanimate Retrieval Test Collections with Extracted and Synthetic Resources.
Key features include:
- Automated data extraction from PDFs (full-text and tables) from a given list of DOIs or local files.
- Synthetic relevance labeling using LLMs.
- Optional human-in-the-loop.
- Parallelized processing for efficiency.
.
├── Docker_ARM/ # Docker configuration for ARM-based systems
├── Docker_CUDA/ # Docker configuration for NVIDIA GPUs
├── Docker_NO_GPU/ # Docker configuration for non-GPU environments
├── extensions/ # Setup for external services (e.g., labeling)
├── notebooks/ # Jupyter notebooks for examples and analysis
├── pyproject.toml # Project configuration and dependencies
├── README.md # This README file
└── src/
└── reanimator/ # Source code for the reanimator package
├── core.py # Main pipeline orchestration
├── downloaders.py # PDF downloading logic
├── extractors.py # Content extraction from documents
├── labelers.py # Synthetic query and label generation
├── models.py # Data models (Document, Query, etc.)
├── retrieval.py # Retrieval and ranking pipelines
└── sources.py # Data source wrappers
It is recommended to install the package in a virtual environment.
-
Create a virtual environment:
python -m venv venv
-
Activate the virtual environment:
- On Windows:
.\\venv\\Scripts\\activate
- On macOS and Linux:
source venv/bin/activate
- On Windows:
-
Install the package: For development, install in editable mode, which allows you to modify the source code and have the changes immediately reflected:
pip install -e .
For a standard installation, use pip to install directly from the GitHub repository:
pip install git+https://github.com/irgroup/Reanimator.git
This project also supports Docker for containerized environments. We provide configurations for different hardware setups:
Docker_CUDA/
: For systems with NVIDIA GPUs.Docker_ARM/
: For ARM-based systems like Apple Silicon.Docker_NO_GPU/
: For environments without a dedicated GPU.
Each directory contains a docker-compose.yml
and Dockerfile
for that specific setup. To build and run the Docker container, navigate to the directory corresponding to your hardware and use the following commands:
# Build the Docker image
docker build -t reanimator .
# Start the services
docker-compose up
For a practical guide on how to use the REANIMATOR framework, please see the example notebook: notebooks/example_reanimation.ipynb
This notebook provides a step-by-step walkthrough of the data processing and reanimation pipeline.
The original data resources for this project are available on Google Drive. This data was generated with an older version of the REANIMATOR framework.
If you use REANIMATOR in your research, please cite our paper, to be published at The 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25), July 13-17, 2025 in Padua, Italy:
@inproceedings{engelmann2025reanimator,
author = {Björn Engelmann and Fabian Haak and Philipp Schaer and Mani Erfanian Abdoust and Linus Netze and Meik Bittkowski},
title = {{REANIMATOR:} Reanimate Retrieval Test Collections with Extracted and Synthetic Resources},
booktitle = {Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '25)},
year = {2025},
month = {July},
address = {Padua, Italy},
doi = {10.1145/3726302.3730342}
}
This project is licensed under the MIT License. See LICENSE for details.