-
Notifications
You must be signed in to change notification settings - Fork 35
Combine GPTHuggingfaceDatasetConfig input sources into source_schema
#255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combine GPTHuggingfaceDatasetConfig input sources into source_schema
#255
Conversation
…nfig_for_multi_source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one minor suggestion, otherwise LGTM!
Co-authored-by: Torsten Scholak <[email protected]>
Co-authored-by: Torsten Scholak <[email protected]>
…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source
Co-authored-by: Torsten Scholak <[email protected]>
…com:ServiceNow/Fast-LLM into restructure_dataset_config_for_multi_source
data_source
source_schema
|
||
if self._loss_masking_spans_column is not None: | ||
if self._loss_masking_spans_column not in dataset.column_names: | ||
raise ValueError(f"Dataset does not have spans field '{self._loss_masking_spans_column}'.") | ||
tokenize_fn = self._tokenize_batch_with_spans | ||
elif self._config.dataset.chosen_text is not None and self._config.dataset.rejected_text is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @tobyzl2 can pls make sure the DPO conditions are properly met.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @nitsanluke, sharing a few lines here for checking DPO conditions. Essentially we want to ensure that these 3 are met
- If loss masking spans (SFT) are already enabled, preference spans (chosen/rejected) should not also be enabled.
- Chosen and rejected spans should either be both be specified or neither should be specified
- If both chosen and rejected are specified, make sure that they are within the dataset columns
Fast-LLM/fast_llm/data/preparator/gpt_memmap/prepare.py
Lines 293 to 298 in b602030
if self._config.dataset.loss_masking_spans is not None and ( | |
self._config.dataset.chosen_text is not None or self._config.dataset.rejected_text is not None | |
): | |
raise ValueError(f"Can not enable both loss masking spans and chosen/rejected loss masking spans.") | |
if (self._config.dataset.chosen_text is None) != (self._config.dataset.rejected_text is None): | |
raise ValueError(f"Both chosen and rejected loss masking spans must be specified if one is specified.") |
Fast-LLM/fast_llm/data/preparator/gpt_memmap/prepare.py
Lines 305 to 309 in b602030
elif self._config.dataset.chosen_text is not None and self._config.dataset.rejected_text is not None: | |
if self._config.dataset.chosen_text not in dataset.column_names: | |
raise ValueError(f"Dataset does not have chosen spans field '{self._config.dataset.chosen_text}'.") | |
if self._config.dataset.rejected_text not in dataset.column_names: | |
raise ValueError(f"Dataset does not have rejected spans field '{self._config.dataset.rejected_text}'.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tobyzl2! I'm adding the checks back. There is some reduandancy on self._config.dataset.loss_masking_spans is not None
but will leave it as is.
Sample config for the default text column tokenizing
|
Is this ready to merge? |
✨ Description
This PR creates a common interface for all
GPTHuggingfaceDatasetConfig
input columns via the newsource_schema
variable. Beyond the variablefiled
we require additional keys to preprocess and tokenize different types of datasets. (eg: SFT, combine cols, etc).Therefore we have created a new variable
source_schema
which can accommodate these different data sources specific preprocessing and tokenization. Current variablesfield
andloss_masking_spans
are moved intoTextColumnConfig
as a type of input/data source.Merge after #245
🔍 Type of change
Select all that apply:
📝 Changes
List the key changes introduced in this PR:
✅ Checklist
Make sure the following tasks are completed before submitting the PR:
General
Dependencies and Configuration
Testing