Combine GPTHuggingfaceDatasetConfig input sources into `source_schema` #255

nitsanluke · 2025-05-07T19:13:52Z

✨ Description

This PR creates a common interface for all GPTHuggingfaceDatasetConfig input columns via the new source_schema variable. Beyond the variable filed we require additional keys to preprocess and tokenize different types of datasets. (eg: SFT, combine cols, etc).
Therefore we have created a new variable source_schema which can accommodate these different data sources specific preprocessing and tokenization. Current variables field and loss_masking_spans are moved into TextColumnConfig as a type of input/data source.

Merge after #245

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

List the key changes introduced in this PR:

✅ Checklist

Make sure the following tasks are completed before submitting the PR:

General

📜 I have read and followed the contributing guidelines.
🏷️ I am using a clear and descriptive PR title that summarizes the key change or feature introduced.
🎉 The functionality is complete, and I have tested the changes.
📝 I have updated the documentation if needed.
⚠️ The change does not introduce any new issues (e.g., runtime warnings, type checker errors, linting problems, unhandled edge cases).
🧩 I have commented my code, especially in hard-to-understand areas.

Dependencies and Configuration

🐋 I have updated the Docker configuration or dependencies, if applicable.
🔄 I have ensured compatibility with the existing setup after dependency changes.

Testing

🧪 I have added or updated tests to cover my changes.
✔️ New and existing tests pass locally with my changes.
🚦 I have tested these changes on GPUs and verified training stability.
🏋️ I have tested the changes on realistic training workloads, if applicable.

…asses

…nfig_for_multi_source

tscholak

one minor suggestion, otherwise LGTM!

fast_llm/data/preparator/gpt_memmap/config.py

Co-authored-by: Torsten Scholak <[email protected]>

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

Co-authored-by: Torsten Scholak <[email protected]>

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

fast_llm/data/preparator/gpt_memmap/prepare.py

fast_llm/data/preparator/gpt_memmap/config.py

nitsanluke · 2025-06-03T20:34:42Z

fast_llm/data/preparator/gpt_memmap/prepare.py

+
+        if self._loss_masking_spans_column is not None:
+            if self._loss_masking_spans_column not in dataset.column_names:
+                raise ValueError(f"Dataset does not have spans field '{self._loss_masking_spans_column}'.")
            tokenize_fn = self._tokenize_batch_with_spans
        elif self._config.dataset.chosen_text is not None and self._config.dataset.rejected_text is not None:


Hi @tobyzl2 can pls make sure the DPO conditions are properly met.

Hi @nitsanluke, sharing a few lines here for checking DPO conditions. Essentially we want to ensure that these 3 are met

If loss masking spans (SFT) are already enabled, preference spans (chosen/rejected) should not also be enabled.

Chosen and rejected spans should either be both be specified or neither should be specified

If both chosen and rejected are specified, make sure that they are within the dataset columns

Fast-LLM/fast_llm/data/preparator/gpt_memmap/prepare.py

Lines 293 to 298 in b602030

if self._config.dataset.loss_masking_spans is not None and (

self._config.dataset.chosen_text is not None or self._config.dataset.rejected_text is not None

):

raise ValueError(f"Can not enable both loss masking spans and chosen/rejected loss masking spans.")

if (self._config.dataset.chosen_text is None) != (self._config.dataset.rejected_text is None):

raise ValueError(f"Both chosen and rejected loss masking spans must be specified if one is specified.")

Fast-LLM/fast_llm/data/preparator/gpt_memmap/prepare.py

Lines 305 to 309 in b602030

elif self._config.dataset.chosen_text is not None and self._config.dataset.rejected_text is not None:

if self._config.dataset.chosen_text not in dataset.column_names:

raise ValueError(f"Dataset does not have chosen spans field '{self._config.dataset.chosen_text}'.")

if self._config.dataset.rejected_text not in dataset.column_names:

raise ValueError(f"Dataset does not have rejected spans field '{self._config.dataset.rejected_text}'.")

Thanks @tobyzl2! I'm adding the checks back. There is some reduandancy on self._config.dataset.loss_masking_spans is not None but will leave it as is.

nitsanluke · 2025-06-03T20:38:04Z

Sample config for the default text column tokenizing

loading_workers: 1
tokenize_workers: 1
saving_workers: 1
output_path: ./test_output/cpt_test_gsm8k
dataset:
  path: openai/gsm8k
  config_name: main
  split: train
  trust_remote_code: true
  source_schema:
    type: text_column
    input_column: question


tokenizer:
  path: /mnt/checkpoints/upstream/Mistral-Nemo-Base-2407/

jlamypoirier · 2025-06-12T16:03:29Z

Is this ready to merge?

jlamypoirier and others added 10 commits April 30, 2025 13:05

Generalize config classes

4b606b0

cli

4a67660

Merge branch 'main' into generalize_dynamic_classes

531f67d

misc

1823407

stuff

fe7acd9

combine data source inputs to data_source

94e56e1

Merge remote-tracking branch 'origin/main' into generalize_dynamic_cl…

bee7a4b

…asses

stuff

d41be60

Merge branch 'generalize_dynamic_classes' into restructure_dataset_co…

6a30d76

…nfig_for_multi_source

fixes

ec35a50

tscholak approved these changes May 8, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/config.py Outdated Show resolved Hide resolved

tscholak reviewed May 8, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/config.py Show resolved Hide resolved

nitsanluke and others added 9 commits May 8, 2025 15:27

Update fast_llm/data/preparator/gpt_memmap/config.py

1dab7de

Co-authored-by: Torsten Scholak <[email protected]>

Update fast_llm/data/preparator/gpt_memmap/config.py

36b42b9

Co-authored-by: Torsten Scholak <[email protected]>

merge

c6876ac

Merge branch 'restructure_dataset_config_for_multi_source' of github.…

eadd49a

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

Update fast_llm/data/preparator/gpt_memmap/config.py

a5b06d8

Co-authored-by: Torsten Scholak <[email protected]>

Merge branch 'restructure_dataset_config_for_multi_source' of github.…

272c63f

…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source

remove duplicate

cbcde98

name change

694181f

adding ClassVar type

fdf44d3

nitsanluke mentioned this pull request May 8, 2025

Concat prompt and completion cols for tokenizing #257

Merged

14 tasks

sohamparikh reviewed May 8, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/prepare.py Outdated Show resolved Hide resolved

nitsanluke changed the title ~~Combine GPTHuggingfaceDatasetConfig input sources into data_source~~ Combine GPTHuggingfaceDatasetConfig input sources into source_schema May 9, 2025

jlamypoirier reviewed May 9, 2025

View reviewed changes

fast_llm/data/preparator/gpt_memmap/config.py Outdated Show resolved Hide resolved

tobyzl2 mentioned this pull request May 12, 2025

DPO #223

Merged

25 tasks

sohamparikh mentioned this pull request May 12, 2025

Support chat template in prepare #262

Open

4 tasks

nitsanluke added 3 commits May 14, 2025 15:00

rename to _text_column

0909768

remove default_factory for source_schema

1a6b78b

minor comment

662f318

Merge branch 'main' into restructure_dataset_config_for_multi_source

8457540

nitsanluke changed the base branch from generalize_dynamic_classes to main June 3, 2025 18:32

nitsanluke added 2 commits June 3, 2025 18:36

reset to main

0ce7571

Megatorn-LM reset to main

bc09402

nitsanluke requested a review from jlamypoirier June 3, 2025 20:30

nitsanluke marked this pull request as ready for review June 3, 2025 20:30

remvoe comment

62bdeee

nitsanluke commented Jun 3, 2025

View reviewed changes

nitsanluke requested a review from sohamparikh June 3, 2025 20:35

update error msg

28f48e1

jlamypoirier approved these changes Jun 4, 2025

View reviewed changes

nitsanluke added 3 commits June 16, 2025 14:03

Merge branch 'main' into restructure_dataset_config_for_multi_source

7d5bb2f

include checks for error msgs

51f88be

Merge branch 'main' into restructure_dataset_config_for_multi_source

48884e6

nitsanluke merged commit d9bb084 into main Jun 16, 2025
4 checks passed

nitsanluke deleted the restructure_dataset_config_for_multi_source branch June 16, 2025 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Combine GPTHuggingfaceDatasetConfig input sources into `source_schema` #255

Combine GPTHuggingfaceDatasetConfig input sources into `source_schema` #255

Uh oh!

nitsanluke commented May 7, 2025 •

edited

Loading

Uh oh!

tscholak left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nitsanluke Jun 3, 2025

Uh oh!

tobyzl2 Jun 4, 2025

Uh oh!

nitsanluke Jun 16, 2025

Uh oh!

nitsanluke commented Jun 3, 2025

Uh oh!

jlamypoirier commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

	if self._config.dataset.loss_masking_spans is not None and (
	self._config.dataset.chosen_text is not None or self._config.dataset.rejected_text is not None
	):
	raise ValueError(f"Can not enable both loss masking spans and chosen/rejected loss masking spans.")
	if (self._config.dataset.chosen_text is None) != (self._config.dataset.rejected_text is None):
	raise ValueError(f"Both chosen and rejected loss masking spans must be specified if one is specified.")

	elif self._config.dataset.chosen_text is not None and self._config.dataset.rejected_text is not None:
	if self._config.dataset.chosen_text not in dataset.column_names:
	raise ValueError(f"Dataset does not have chosen spans field '{self._config.dataset.chosen_text}'.")
	if self._config.dataset.rejected_text not in dataset.column_names:
	raise ValueError(f"Dataset does not have rejected spans field '{self._config.dataset.rejected_text}'.")

Combine GPTHuggingfaceDatasetConfig input sources into source_schema #255

Combine GPTHuggingfaceDatasetConfig input sources into source_schema #255

Uh oh!

Conversation

nitsanluke commented May 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

✅ Checklist

General

Dependencies and Configuration

Testing

Uh oh!

tscholak left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nitsanluke Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

tobyzl2 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

nitsanluke Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

nitsanluke commented Jun 3, 2025

Uh oh!

jlamypoirier commented Jun 12, 2025

Uh oh!

Uh oh!

Uh oh!

Combine GPTHuggingfaceDatasetConfig input sources into `source_schema` #255

Combine GPTHuggingfaceDatasetConfig input sources into `source_schema` #255

nitsanluke commented May 7, 2025 •

edited

Loading