Training Data Memorization in LLMs

This repository contains code for the prefix experiments conducted as the first part of my master’s dissertation, comparing training data memorization between different deep learning architectures—specifically SSMs (Structured State Space Models) and transformers. Mamba and Pythia models were chosen as representatives from the respective architectures due to size and training comparability.

The experiments are based on the work of Carlini, aiming to extend their analysis by examining factors such as model size, input length, and data type across architectures. The objective is to assess how these factors affect memorization trends across architectures.

Overview

Experiments Conducted

Comparison of Model Sizes: Different model sizes were used to explore the relationship between model capacity and memorization.
Input Length Variations: We tested different input lengths to understand the effect of context length on memorization behavior.
Data Type: The experiments utilized subsets from The Pile dataset, comparing how different data types impact memorization rates.

The code in this repository is designed to gather outputs from models and store them in CSV format for subsequent analysis. The collected data was further processed to evaluate and compare the memorization characteristics of each model.

scripts/: Scripts used to train models and collect output data.
setup.md: Detailed setup instructions, including the techniques used to configure and run the scripts.
model_utils.py: Streams and parses data from hf to be used as samples, calculates perplexity & prints samples from evaluation
main.py: Loads the models, performs the prefix attack, & stores the response and output per model in a corresponding csv

Getting Started

To set up the environment and replicate the experiments, please refer to setup.md. This document provides step-by-step instructions and describes the techniques used to configure and execute the scripts.

Requirements

Python 3.9>
Required libraries: [details in setup.md]
Models/Dataset: Download links or paths (as mentioned in setup.md)

Running the Scripts

Setup Environment: Follow the instructions in setup.md to set up the required environment.
Run Experiments: Use the scripts in scripts/ to run model comparisons.

Acknowledgments

This work builds upon the foundational research of Carlini et al., whose contributions to understanding memorization in LLMs served as a basis for these experiments.

Name		Name	Last commit message	Last commit date
Latest commit History 309 Commits
local-env		local-env
scripts		scripts
.gitignore		.gitignore
README.md		README.md
main.py		main.py
model_utils.py		model_utils.py
requirements.txt		requirements.txt
setup.md		setup.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Training Data Memorization in LLMs

Overview

Experiments Conducted

Contents

Getting Started

Requirements

Running the Scripts

Acknowledgments

About

Uh oh!

Releases

Packages

Uh oh!

Languages

dmcgrath19/ArchitectureExtraction

Folders and files

Latest commit

History

Repository files navigation

Training Data Memorization in LLMs

Overview

Experiments Conducted

Contents

Getting Started

Requirements

Running the Scripts

Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages