FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback

A research framework for evaluating how large language models incorporate different styles of feedback across multiple reasoning domains.

Overview

This project implements a unified framework to test large language models' (LLMs) ability to use different types of feedback:

Binary feedback: Simple correct/incorrect signals
Self-generated feedback: Model-generated reflective feedback
Strong-model feedback: External model-generated feedback

The framework runs multiple iterations of generation and refinement across various datasets (MMLU, MMLU-Pro, GPQA, MATH-500, AIME 2024, PopQA, TriviaQA, arithmetic, and hexadecimal multiplication) to measure iterative self-improvement capabilities.

The logs generated by LLaMa-4-Maverick and feedback model can be found here. Additional model logs will be added shortly.

Installation

Prerequisites

Python 3.9+
vLLM 0.8.3+ (for model serving)
OpenAI API key (optional, for strong-model feedback)

Setup

git clone https://github.com/JHU-CLSP/Feedback-Friction.git
cd Feedback-Friction
pip install vllm==0.8.3 datasets
pip install -r requirements.txt

Environment Configuration

Set your OpenAI API key if using strong-model feedback:

export OPENAI_API_KEY="your-api-key-here"

Usage

All experiments are driven by openai_async_process.py. The basic command structure is:

python openai_async_process.py \
    --dataset DATASET \
    --agent_model MODEL_NAME \
    --base_url BASE_URL \
    --ports PORT_LIST \
    --write_file OUTPUT_FILE \
    --iterations NUM_ITERATIONS \
    [FEEDBACK_OPTIONS]

Example Commands

Basic usage (binary feedback only):

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10

Self-generated feedback:

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10 \
    --use_feedback

Process-level feedback:

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10 \
    --use_feedback \
    --use_process_feedback

Strong-model feedback (requires OpenAI API key):

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10 \
    --use_feedback \
    --use_process_feedback \
    --use_openai

Configuration Options

Option	Type	Default	Description
`--dataset`	str	`"math"`	Dataset to evaluate (see supported datasets below)
`--agent_model`	str	`"meta-llama/Meta-Llama-3-8B-Instruct"`	Model name for vLLM server
`--write_file`	str	`"output_arc.jsonl"`	Output file path
`--base_url`	str	`"http://c004"`	vLLM server base URL
`--ports`	str	`"1233_1234_1235_1236"`	Underscore-separated server ports
`--temperature`	float	`0.0`	Sampling temperature
`--iterations`	int	`10`	Number of feedback iterations
`--proportion`	float	`1.0`	Fraction of dataset to use (0-1)
`--use_feedback`	flag	`False`	Enable self-generated feedback
`--use_process_feedback`	flag	`False`	Enable process-level feedback
`--use_openai`	flag	`False`	Use OpenAI for feedback generation
`--shuffle`	flag	`False`	Shuffle MCQ answer choices between iterations
`--binary_hint`	flag	`False`	Provide hints about previous incorrect choices
`--in_temp`	flag	`False`	Increase temperature each iteration
`--best_of_n`	flag	`False`	Enable best-of-n sampling per round
`--logprobs`	int	`None`	Number of log probabilities to return

Supported Datasets

MMLU: Massive Multitask Language Understanding
MMLU-Pro: Enhanced version of MMLU
GPQA: Graduate-level Google-Proof Q&A
MATH: MATH-500 mathematical reasoning
AIME 2024: American Invitational Mathematics Examination
TriviaQA: Trivia question answering
PopQA: Popular question answering
Custom Simple: 5-digit decimal multiplication
Hex: 5-digit hexadecimal multiplication

Deprecated: GSM8K, GSM8K-Symbolic (no longer supported)

Feedback Modes

1. Binary Feedback (Default)

Provides only correct/incorrect signals after each attempt.

2. Self-Generated Feedback (`--use_feedback`)

The model generates its own reflective feedback about errors.

3. Process-Level Feedback (`--use_feedback --use_process_feedback`)

Includes detailed reasoning process in feedback generation.

4. Strong-Model Feedback (`--use_feedback --use_process_feedback --use_openai`)

Uses OpenAI's models to generate high-quality feedback.

Output Format

Results are saved as JSONL files with the following fields:

question: Complete interaction history with original question
normalized_answer: Ground truth answer
normalized_prediction: Extracted model prediction
full_response: Raw model response for current iteration
feedback: Generated feedback (if feedback is enabled)
response_probs: Average log probability per token
is_correct: Whether current iteration is correct
iteration: Current iteration number (starting from 0)

File Structure

openai_async_process.py: Main experiment runner with multiple sampling and best-of-n logic
utils.py: Core utilities and dataset handling
error_analysis.py: Feedback-based iterative improvement system (requires OpenAI API)
oracle_beam_search.py: Oracle upper bound evaluation via large beam search sampling
digit_multiplication/: Specialized digit multiplication modules
- decimal.py: 5-6 digit decimal multiplication with step-by-step distributive property hints
- hexadecimal.py: 5-6 digit hexadecimal multiplication with base-16 step-by-step explanations
start_server.sh: Unified vLLM server startup script for both 70B and 405B models

Oracle Beam Search Evaluation

The oracle_beam_search.py script provides an upper bound performance estimate by generating many responses per question and checking if any are correct. This helps evaluate the theoretical maximum accuracy achievable with larger beam sizes.

Usage:

python oracle_beam_search.py \
    --dataset math \
    --agent_model meta-llama/Llama-3.1-70B-Instruct \
    --attempts 10 \
    --gens 10 \
    --write_file oracle_results.jsonl

This generates attempts × gens total responses per question (100 in the example) and reports the percentage that contain at least one correct answer.

Digit Multiplication Datasets

The framework includes specialized datasets for testing arithmetic reasoning:

Decimal Multiplication (`custom_simple`)

Purpose: Test systematic arithmetic reasoning with 5-6 digit numbers
Method: Uses distributive property breakdown (e.g., 12345 = 12000 + 345)
Hints: Step-by-step partial product computation and summation

Hexadecimal Multiplication (`hex`)

Purpose: Test base-16 arithmetic reasoning with numbers that look decimal but are interpreted in hex
Method: Digit-by-digit multiplication in base 16 with proper carry handling
Verification: Automatically validates against built-in hex arithmetic

Both datasets help evaluate whether models can follow systematic computational procedures rather than relying on memorized arithmetic facts.

Citation

If you use this repo, please cite the original paper:

@article{feedback_friction_2025,
  title={FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback},
  author={Dongwei Jiang and Alvin Zhang and Andrew Wang and Nicholas Andrews and Daniel Khashabi},
  journal={arXiv preprint arXiv:2506.11930},
  year={2025}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback

Overview

Installation

Prerequisites

Setup

Environment Configuration

Usage

Example Commands

Configuration Options

Supported Datasets

Feedback Modes

1. Binary Feedback (Default)

2. Self-Generated Feedback (`--use_feedback`)

3. Process-Level Feedback (`--use_feedback --use_process_feedback`)

4. Strong-Model Feedback (`--use_feedback --use_process_feedback --use_openai`)

Output Format

File Structure

Oracle Beam Search Evaluation

Digit Multiplication Datasets

Decimal Multiplication (`custom_simple`)

Hexadecimal Multiplication (`hex`)

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
digit_multiplication		digit_multiplication
.gitignore		.gitignore
README.md		README.md
dataset_specific_utils.py		dataset_specific_utils.py
error_analysis.py		error_analysis.py
openai_async_process.py		openai_async_process.py
oracle_beam_search.py		oracle_beam_search.py
start_server.sh		start_server.sh
utils.py		utils.py

JHU-CLSP/Feedback-Friction

Folders and files

Latest commit

History

Repository files navigation

FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback

Overview

Installation

Prerequisites

Setup

Environment Configuration

Usage

Example Commands

Configuration Options

Supported Datasets

Feedback Modes

1. Binary Feedback (Default)

2. Self-Generated Feedback (--use_feedback)

3. Process-Level Feedback (--use_feedback --use_process_feedback)

4. Strong-Model Feedback (--use_feedback --use_process_feedback --use_openai)

Output Format

File Structure

Oracle Beam Search Evaluation

Digit Multiplication Datasets

Decimal Multiplication (custom_simple)

Hexadecimal Multiplication (hex)

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

2. Self-Generated Feedback (`--use_feedback`)

3. Process-Level Feedback (`--use_feedback --use_process_feedback`)

4. Strong-Model Feedback (`--use_feedback --use_process_feedback --use_openai`)

Decimal Multiplication (`custom_simple`)

Hexadecimal Multiplication (`hex`)

Packages