SLR: An Automated Synthesis Framework for Scalable Logical Reasoning

Note: Source code of the generation pipeline will follow shortly.

🆕 June 2024: Reward Model & Evaluation Pipeline Released!
Systematically evaluate model-generated rules via symbolic execution, fully automatic and verifiable. Supports evaluation and RLVR. 👉 Eval & Reward Demo

Overview

SLR (Scalable Logical Reasoning) is an end-to-end framework for systematic evaluation and training of Large Language Models (LLMs) through Scalable Logical Reasoning. Given a user’s task specification, SLR enables scalable, automated synthesis of inductive reasoning tasks with precisely controlled difficulty. For each task, SLR generates (i) a latent ground-truth rule, (ii) an executable validation program for deterministic, symbolic evaluation, and (iii) an instruction prompt for the reasoning task. With SLR, we introduce SLR-Bench—a benchmark of over 19,000 prompts across 20 curriculum levels, progressively increasing in relational, arithmetic, and recursive complexity. Large-scale evaluation shows that while modern LLMs can produce syntactically valid rules, they often struggle with correct logical inference. Recent reasoning-focused LLMs perform better but require much more compute, sometimes exceeding 15,000 completion tokens. Logic-tuning with SLR doubles Llama-3-8B’s accuracy on SLR-Bench, matching Gemini-Flash-Thinking at a fraction of the computational cost. SLR is fully automated, requires no human annotation, ensures dataset novelty, and provides a scalable environment for advancing LLM reasoning capabilities.

Quick start SLR-Bench

Loading the Dataset

from datasets import load_dataset
# Load SLR-Bench test split
ds = load_dataset("AIML-TUDA/SLR-Bench", "v1-All", split="test")

Evaluate using SLR-Bench

Requires the evaluate library and a Prolog interpreter installed on your system (e.g., SWI-Prolog). Install the required dependencies via:

pip install evaluate
sudo apt-get install swi-prolog

Example Usage

from evaluate import load
symbolic_judge = load("AIML-TUDA/VerifiableRewardsForScalableLogicalReasoning")
rules = ds["ground-truth rule"]  # For demo only—use model predictions in practice
references = [
    {
        "validation_program": p,
        "evaluation_config": {
            "positive_predicate": "eastbound",
            "negative_predicate": "westbound"
        }
    } for p in ds["validation program"]
]

results = symbolic_judge.compute(predictions=rules, references=references)
print(results)

Note: For real evaluation, replace rules with your model's predicted rules. Here, we use ground-truth rules for demonstration only.

🔨 Automatic Task Generation: Synthesize new inductive reasoning tasks with controllable complexity, novel logic rules, and natural language prompts—no need for human annotation.
🧩 Programmable & Scalable: Specify your own logic vocabulary, grammar, rule distributions, and task parameters; supports curriculum-style scaling and out-of-distribution task creation.
🧠 Symbolic, Automated Evaluation: Deterministically verify LLM outputs via the validation program, not MCQA, LLM judge, or exact matching.
📈 Curriculum Learning: Use SLR-Bench, a structured 20-level benchmark, for evaluating and training models across a span of logical challenges.

SLR: End-to-end pipeline for logic task generation, deterministic symbolic evaluation, and logic-based LLM training.

SLR-Bench: Benchmarking Reasoning at Scale

We instantiate SLR as SLR-Bench, an automatically generated reasoning benchmark for LLMs. SLR-Bench comprises 19,000+ prompts spanning 20 curriculum levels, from simple attribute lookups to advanced relational, arithmetic, and recursive tasks.

SLR-Bench: As logic complexity rises, model accuracy drops—exposing limits of contemporary LLMs and creating new challenges for future models.

Below is the leaderboard for various LLMs on SLR-Bench (V0.1):

Reasoning LLMs outperform base LLMs on logical accuracy but incur much higher compute requirements. Logic-tuning with SLR dramatically boosts performance for base models at low cost.

Citation

@misc{helff2025slrautomatedsynthesisframework,
      title={SLR: An Automated Synthesis Framework for Scalable Logical Reasoning}, 
      author={Lukas Helff and Ahmad Omar and Felix Friedrich and Wolfgang Stammer and Antonia Wüst and Tim Woydt and Rupert Mitchell and Patrick Schramowski and Kristian Kersting},
      year={2025},
      eprint={2506.15787},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2506.15787}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
images		images
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

SLR: An Automated Synthesis Framework for Scalable Logical Reasoning

Overview

Quick start SLR-Bench

Loading the Dataset

Evaluate using SLR-Bench

Example Usage

SLR-Bench: Benchmarking Reasoning at Scale

Citation

About

Uh oh!

Releases

Packages

License

ml-research/ScalableLogicalReasoning

Folders and files

Latest commit

History

Repository files navigation

SLR: An Automated Synthesis Framework for Scalable Logical Reasoning

Overview

Quick start SLR-Bench

Loading the Dataset

Evaluate using SLR-Bench

Example Usage

SLR-Bench: Benchmarking Reasoning at Scale

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Packages