Add dataset mixer #3791

lewtun · 2025-07-28T17:33:41Z

What does this PR do?

This PR adds support to mix datasets / config / splits for more effective training recipes. The usage is best demonstrated via a YAML config with something like:

dataset_mixture:
    datasets:
    - id: dataset_id_1
        config: config_name
        columns:
        - col1
        - col2
        weight: 0.5
    - id: dataset_id_2
        config: config_name
        columns:
        - col1
        - col2
        weight: 0.5
    seed: 42
    test_split_size: 0.1

The PR also introduces a helper get_dataset() method to make it simpler to return the dataset from either the mixture or the single dataset setup we had previously.

A few questions:

I am not sure if weight is the right terminology since it's not relative to other configs. An alternative would be frac or pct
Where should this be documented?
Should I add the get_dataset() utitlity function to our training scripts or is this feature best left for power users?
One limitation with the current implementation is that you cannot independently specify which subsets should be mixed as a train or test split (we create a separate test split if requested by the user). Do we need to support this?

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

HuggingFaceDocBuilderDev · 2025-07-28T17:38:33Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-07-28T17:41:49Z

Nice! Can you replace load_dataset by get_dataset in the scripts (trl/scripts)?

lewtun added 3 commits July 28, 2025 17:08

Add dataset mixer

6165d2e

Refactor tests

1a33a44

Fix

9af3ee7

lewtun requested review from edbeeching and qgallouedec July 28, 2025 17:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add dataset mixer #3791

Add dataset mixer #3791

lewtun commented Jul 28, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 28, 2025

Uh oh!

qgallouedec commented Jul 28, 2025

Uh oh!

Uh oh!

Add dataset mixer #3791

Are you sure you want to change the base?

Add dataset mixer #3791

Conversation

lewtun commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

HuggingFaceDocBuilderDev commented Jul 28, 2025

Uh oh!

qgallouedec commented Jul 28, 2025

Uh oh!

Uh oh!

lewtun commented Jul 28, 2025 •

edited

Loading