A research framework for evaluating how large language models incorporate different styles of feedback across multiple reasoning domains.
This project implements a unified framework to test large language models' (LLMs) ability to use different types of feedback:
- Binary feedback: Simple correct/incorrect signals
- Self-generated feedback: Model-generated reflective feedback
- Strong-model feedback: External model-generated feedback
The framework runs multiple iterations of generation and refinement across various datasets (MMLU, MMLU-Pro, GPQA, MATH-500, AIME 2024, PopQA, TriviaQA, arithmetic, and hexadecimal multiplication) to measure iterative self-improvement capabilities.
The logs generated by LLaMa-4-Maverick and feedback model can be found here. Additional model logs will be added shortly.
- Python 3.9+
- vLLM 0.8.3+ (for model serving)
- OpenAI API key (optional, for strong-model feedback)
git clone https://github.com/JHU-CLSP/Feedback-Friction.git
cd Feedback-Friction
pip install vllm==0.8.3 datasets
pip install -r requirements.txt
Set your OpenAI API key if using strong-model feedback:
export OPENAI_API_KEY="your-api-key-here"
All experiments are driven by openai_async_process.py
. The basic command structure is:
python openai_async_process.py \
--dataset DATASET \
--agent_model MODEL_NAME \
--base_url BASE_URL \
--ports PORT_LIST \
--write_file OUTPUT_FILE \
--iterations NUM_ITERATIONS \
[FEEDBACK_OPTIONS]
Basic usage (binary feedback only):
python openai_async_process.py \
--dataset gpqa \
--agent_model meta-llama/Llama-3.3-70B-Instruct \
--base_url http://c007 \
--ports 1233 \
--write_file gpqa_log.jsonl \
--iterations 10
Self-generated feedback:
python openai_async_process.py \
--dataset gpqa \
--agent_model meta-llama/Llama-3.3-70B-Instruct \
--base_url http://c007 \
--ports 1233 \
--write_file gpqa_log.jsonl \
--iterations 10 \
--use_feedback
Process-level feedback:
python openai_async_process.py \
--dataset gpqa \
--agent_model meta-llama/Llama-3.3-70B-Instruct \
--base_url http://c007 \
--ports 1233 \
--write_file gpqa_log.jsonl \
--iterations 10 \
--use_feedback \
--use_process_feedback
Strong-model feedback (requires OpenAI API key):
python openai_async_process.py \
--dataset gpqa \
--agent_model meta-llama/Llama-3.3-70B-Instruct \
--base_url http://c007 \
--ports 1233 \
--write_file gpqa_log.jsonl \
--iterations 10 \
--use_feedback \
--use_process_feedback \
--use_openai
Option | Type | Default | Description |
---|---|---|---|
--dataset |
str | "math" |
Dataset to evaluate (see supported datasets below) |
--agent_model |
str | "meta-llama/Meta-Llama-3-8B-Instruct" |
Model name for vLLM server |
--write_file |
str | "output_arc.jsonl" |
Output file path |
--base_url |
str | "http://c004" |
vLLM server base URL |
--ports |
str | "1233_1234_1235_1236" |
Underscore-separated server ports |
--temperature |
float | 0.0 |
Sampling temperature |
--iterations |
int | 10 |
Number of feedback iterations |
--proportion |
float | 1.0 |
Fraction of dataset to use (0-1) |
--use_feedback |
flag | False |
Enable self-generated feedback |
--use_process_feedback |
flag | False |
Enable process-level feedback |
--use_openai |
flag | False |
Use OpenAI for feedback generation |
--shuffle |
flag | False |
Shuffle MCQ answer choices between iterations |
--binary_hint |
flag | False |
Provide hints about previous incorrect choices |
--in_temp |
flag | False |
Increase temperature each iteration |
--best_of_n |
flag | False |
Enable best-of-n sampling per round |
--logprobs |
int | None |
Number of log probabilities to return |
- MMLU: Massive Multitask Language Understanding
- MMLU-Pro: Enhanced version of MMLU
- GPQA: Graduate-level Google-Proof Q&A
- MATH: MATH-500 mathematical reasoning
- AIME 2024: American Invitational Mathematics Examination
- TriviaQA: Trivia question answering
- PopQA: Popular question answering
- Custom Simple: 5-digit decimal multiplication
- Hex: 5-digit hexadecimal multiplication
Deprecated: GSM8K, GSM8K-Symbolic (no longer supported)
Provides only correct/incorrect signals after each attempt.
The model generates its own reflective feedback about errors.
Includes detailed reasoning process in feedback generation.
Uses OpenAI's models to generate high-quality feedback.
Results are saved as JSONL files with the following fields:
- question: Complete interaction history with original question
- normalized_answer: Ground truth answer
- normalized_prediction: Extracted model prediction
- full_response: Raw model response for current iteration
- feedback: Generated feedback (if feedback is enabled)
- response_probs: Average log probability per token
- is_correct: Whether current iteration is correct
- iteration: Current iteration number (starting from 0)
openai_async_process.py
: Main experiment runner with multiple sampling and best-of-n logicutils.py
: Core utilities and dataset handlingerror_analysis.py
: Feedback-based iterative improvement system (requires OpenAI API)oracle_beam_search.py
: Oracle upper bound evaluation via large beam search samplingdigit_multiplication/
: Specialized digit multiplication modulesdecimal.py
: 5-6 digit decimal multiplication with step-by-step distributive property hintshexadecimal.py
: 5-6 digit hexadecimal multiplication with base-16 step-by-step explanations
start_server.sh
: Unified vLLM server startup script for both 70B and 405B models
The oracle_beam_search.py
script provides an upper bound performance estimate by generating many responses per question and checking if any are correct. This helps evaluate the theoretical maximum accuracy achievable with larger beam sizes.
Usage:
python oracle_beam_search.py \
--dataset math \
--agent_model meta-llama/Llama-3.1-70B-Instruct \
--attempts 10 \
--gens 10 \
--write_file oracle_results.jsonl
This generates attempts × gens
total responses per question (100 in the example) and reports the percentage that contain at least one correct answer.
The framework includes specialized datasets for testing arithmetic reasoning:
- Purpose: Test systematic arithmetic reasoning with 5-6 digit numbers
- Method: Uses distributive property breakdown (e.g., 12345 = 12000 + 345)
- Hints: Step-by-step partial product computation and summation
- Purpose: Test base-16 arithmetic reasoning with numbers that look decimal but are interpreted in hex
- Method: Digit-by-digit multiplication in base 16 with proper carry handling
- Verification: Automatically validates against built-in hex arithmetic
Both datasets help evaluate whether models can follow systematic computational procedures rather than relying on memorized arithmetic facts.
If you use this repo, please cite the original paper:
@article{feedback_friction_2025,
title={FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback},
author={Dongwei Jiang and Alvin Zhang and Andrew Wang and Nicholas Andrews and Daniel Khashabi},
journal={arXiv preprint arXiv:2506.11930},
year={2025}
}