Note: Source code of the generation pipeline will follow shortly.
🆕 June 2024: Reward Model & Evaluation Pipeline Released!
Systematically evaluate model-generated rules via symbolic execution, fully automatic and verifiable. Supports evaluation and RLVR. 👉 Eval & Reward Demo
SLR (Scalable Logical Reasoning) is an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) through Scalable Logical Reasoning. Given a user’s task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR generates (i) a latent ground-truth rule, (ii) an executable validation program for deterministic, symbolic evaluation, and (iii) an instruction prompt for the reasoning task. With SLR, we introduce SLR-Bench—a benchmark of over 19,000 prompts across 20 curriculum levels, progressively increasing in relational, arithmetic, and recursive complexity. Large-scale evaluation shows that while modern LLMs can produce syntactically valid rules, they often struggle with correct logical inference. Recent reasoning-focused LLMs perform better but require much more compute, sometimes exceeding 15,000 completion tokens. Logic-tuning with SLR doubles Llama-3-8B’s accuracy on SLR-Bench, matching Gemini-Flash-Thinking at a fraction of the computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and provides a scalable environment for advancing LLM reasoning capabilities.
from datasets import load_dataset
# Load SLR-Bench test split
ds = load_dataset("AIML-TUDA/SLR-Bench", "v1-All", split="test")
Requires the evaluate
library and a Prolog interpreter installed on your system (e.g., SWI-Prolog).
Install the required dependencies via:
pip install evaluate
sudo apt-get install swi-prolog
from evaluate import load
symbolic_judge = load("AIML-TUDA/VerifiableRewardsForScalableLogicalReasoning")
rules = ds["ground-truth rule"] # For demo only—use model predictions in practice
references = [
{
"validation_program": p,
"evaluation_config": {
"positive_predicate": "eastbound",
"negative_predicate": "westbound"
}
} for p in ds["validation program"]
]
results = symbolic_judge.compute(predictions=rules, references=references)
print(results)
Note: For real evaluation, replace rules
with your model's predicted rules. Here, we use ground-truth rules for demonstration only.
- 🔨 Automatic Task Generation: Synthesize new inductive reasoning tasks with controllable complexity, novel logic rules, and natural language prompts—no need for human annotation.
- 🧩 Programmable & Scalable: Specify your own logic vocabulary, grammar, rule distributions, and task parameters; supports curriculum-style scaling and out-of-distribution task creation.
- 🧠 Symbolic, Automated Evaluation: Deterministically verify LLM outputs via the validation program, not MCQA, LLM judge, or exact matching.
- 📈 Curriculum Learning: Use SLR-Bench, a structured 20-level benchmark, for evaluating and training models across a span of logical challenges.
SLR: End-to-end pipeline for logic task generation, deterministic symbolic evaluation, and logic-based LLM training.
We instantiate SLR as SLR-Bench, an automatically generated reasoning benchmark for LLMs. SLR-Bench comprises 19,000+ prompts spanning 20 curriculum levels, from simple attribute lookups to advanced relational, arithmetic, and recursive tasks.
SLR-Bench: As logic complexity rises, model accuracy drops—exposing limits of contemporary LLMs and creating new challenges for future models.
Below is the leaderboard for various LLMs on SLR-Bench (V0.1):
Reasoning LLMs outperform base LLMs on logical accuracy but incur much higher compute requirements. Logic-tuning with SLR dramatically boosts performance for base models at low cost.
@misc{helff2025slrautomatedsynthesisframework,
title={SLR: An Automated Synthesis Framework for Scalable Logical Reasoning},
author={Lukas Helff and Ahmad Omar and Felix Friedrich and Wolfgang Stammer and Antonia Wüst and Tim Woydt and Rupert Mitchell and Patrick Schramowski and Kristian Kersting},
year={2025},
eprint={2506.15787},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2506.15787},
}