Skip to content

JHU-CLSP/Feedback-Friction

Repository files navigation

FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback

A research framework for evaluating how large language models incorporate different styles of feedback across multiple reasoning domains.

Overview

This project implements a unified framework to test large language models' (LLMs) ability to use different types of feedback:

  • Binary feedback: Simple correct/incorrect signals
  • Self-generated feedback: Model-generated reflective feedback
  • Strong-model feedback: External model-generated feedback

The framework runs multiple iterations of generation and refinement across various datasets (MMLU, MMLU-Pro, GPQA, MATH-500, AIME 2024, PopQA, TriviaQA, arithmetic, and hexadecimal multiplication) to measure iterative self-improvement capabilities.

The logs generated by LLaMa-4-Maverick and feedback model can be found here. Additional model logs will be added shortly.

Installation

Prerequisites

  • Python 3.9+
  • vLLM 0.8.3+ (for model serving)
  • OpenAI API key (optional, for strong-model feedback)

Setup

git clone https://github.com/JHU-CLSP/Feedback-Friction.git
cd Feedback-Friction
pip install vllm==0.8.3 datasets
pip install -r requirements.txt

Environment Configuration

Set your OpenAI API key if using strong-model feedback:

export OPENAI_API_KEY="your-api-key-here"

Usage

All experiments are driven by openai_async_process.py. The basic command structure is:

python openai_async_process.py \
    --dataset DATASET \
    --agent_model MODEL_NAME \
    --base_url BASE_URL \
    --ports PORT_LIST \
    --write_file OUTPUT_FILE \
    --iterations NUM_ITERATIONS \
    [FEEDBACK_OPTIONS]

Example Commands

Basic usage (binary feedback only):

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10

Self-generated feedback:

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10 \
    --use_feedback

Process-level feedback:

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10 \
    --use_feedback \
    --use_process_feedback

Strong-model feedback (requires OpenAI API key):

python openai_async_process.py \
    --dataset gpqa \
    --agent_model meta-llama/Llama-3.3-70B-Instruct \
    --base_url http://c007 \
    --ports 1233 \
    --write_file gpqa_log.jsonl \
    --iterations 10 \
    --use_feedback \
    --use_process_feedback \
    --use_openai

Configuration Options

Option Type Default Description
--dataset str "math" Dataset to evaluate (see supported datasets below)
--agent_model str "meta-llama/Meta-Llama-3-8B-Instruct" Model name for vLLM server
--write_file str "output_arc.jsonl" Output file path
--base_url str "http://c004" vLLM server base URL
--ports str "1233_1234_1235_1236" Underscore-separated server ports
--temperature float 0.0 Sampling temperature
--iterations int 10 Number of feedback iterations
--proportion float 1.0 Fraction of dataset to use (0-1)
--use_feedback flag False Enable self-generated feedback
--use_process_feedback flag False Enable process-level feedback
--use_openai flag False Use OpenAI for feedback generation
--shuffle flag False Shuffle MCQ answer choices between iterations
--binary_hint flag False Provide hints about previous incorrect choices
--in_temp flag False Increase temperature each iteration
--best_of_n flag False Enable best-of-n sampling per round
--logprobs int None Number of log probabilities to return

Supported Datasets

  • MMLU: Massive Multitask Language Understanding
  • MMLU-Pro: Enhanced version of MMLU
  • GPQA: Graduate-level Google-Proof Q&A
  • MATH: MATH-500 mathematical reasoning
  • AIME 2024: American Invitational Mathematics Examination
  • TriviaQA: Trivia question answering
  • PopQA: Popular question answering
  • Custom Simple: 5-digit decimal multiplication
  • Hex: 5-digit hexadecimal multiplication

Deprecated: GSM8K, GSM8K-Symbolic (no longer supported)

Feedback Modes

1. Binary Feedback (Default)

Provides only correct/incorrect signals after each attempt.

2. Self-Generated Feedback (--use_feedback)

The model generates its own reflective feedback about errors.

3. Process-Level Feedback (--use_feedback --use_process_feedback)

Includes detailed reasoning process in feedback generation.

4. Strong-Model Feedback (--use_feedback --use_process_feedback --use_openai)

Uses OpenAI's models to generate high-quality feedback.

Output Format

Results are saved as JSONL files with the following fields:

  • question: Complete interaction history with original question
  • normalized_answer: Ground truth answer
  • normalized_prediction: Extracted model prediction
  • full_response: Raw model response for current iteration
  • feedback: Generated feedback (if feedback is enabled)
  • response_probs: Average log probability per token
  • is_correct: Whether current iteration is correct
  • iteration: Current iteration number (starting from 0)

File Structure

  • openai_async_process.py: Main experiment runner with multiple sampling and best-of-n logic
  • utils.py: Core utilities and dataset handling
  • error_analysis.py: Feedback-based iterative improvement system (requires OpenAI API)
  • oracle_beam_search.py: Oracle upper bound evaluation via large beam search sampling
  • digit_multiplication/: Specialized digit multiplication modules
    • decimal.py: 5-6 digit decimal multiplication with step-by-step distributive property hints
    • hexadecimal.py: 5-6 digit hexadecimal multiplication with base-16 step-by-step explanations
  • start_server.sh: Unified vLLM server startup script for both 70B and 405B models

Oracle Beam Search Evaluation

The oracle_beam_search.py script provides an upper bound performance estimate by generating many responses per question and checking if any are correct. This helps evaluate the theoretical maximum accuracy achievable with larger beam sizes.

Usage:

python oracle_beam_search.py \
    --dataset math \
    --agent_model meta-llama/Llama-3.1-70B-Instruct \
    --attempts 10 \
    --gens 10 \
    --write_file oracle_results.jsonl

This generates attempts × gens total responses per question (100 in the example) and reports the percentage that contain at least one correct answer.

Digit Multiplication Datasets

The framework includes specialized datasets for testing arithmetic reasoning:

Decimal Multiplication (custom_simple)

  • Purpose: Test systematic arithmetic reasoning with 5-6 digit numbers
  • Method: Uses distributive property breakdown (e.g., 12345 = 12000 + 345)
  • Hints: Step-by-step partial product computation and summation

Hexadecimal Multiplication (hex)

  • Purpose: Test base-16 arithmetic reasoning with numbers that look decimal but are interpreted in hex
  • Method: Digit-by-digit multiplication in base 16 with proper carry handling
  • Verification: Automatically validates against built-in hex arithmetic

Both datasets help evaluate whether models can follow systematic computational procedures rather than relying on memorized arithmetic facts.

Citation

If you use this repo, please cite the original paper:

@article{feedback_friction_2025,
  title={FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback},
  author={Dongwei Jiang and Alvin Zhang and Andrew Wang and Nicholas Andrews and Daniel Khashabi},
  journal={arXiv preprint arXiv:2506.11930},
  year={2025}
}

About

Code for paper FEEDBACK FRICTION: LLMs Struggle to Fully Incorporate External Feedback https://arxiv.org/pdf/2506.11930

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •