Skip to content

fix: raise error in FolderBasedBuilder when data_dir and data_files are missing #7623

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 18, 2025

Conversation

ArjunJagdale
Copy link
Contributor

Related Issues/PRs

Fixes #6152


What changes are proposed in this pull request?

This PR adds a dedicated validation check in the _info() method of the FolderBasedBuilder class to ensure that users provide either data_dir or data_files when loading folder-based datasets (such as audiofolder, imagefolder, etc.).


Why this change?

Previously, when calling:

load_dataset("audiofolder")

without specifying data_dir or data_files, the loader would silently fallback to the current working directory, leading to:

  • Long loading times
  • Unexpected behavior (e.g., scanning unrelated files)

This behavior was discussed in issue #6152. As suggested by maintainers, the fix has now been implemented directly inside the FolderBasedBuilder._info() method — keeping the logic localized to the specific builder instead of a generic loader function.


How is this PR tested?

  • ✅ Manually tested by calling load_dataset("audiofolder") with no data_dir or data_files → a ValueError is now raised early.
  • ✅ Existing functionality (with valid input) remains unaffected.

Does this PR require documentation update?

  • No

Release Notes

Is this a user-facing change?

  • Yes

Folder-based datasets now raise an explicit error if neither data_dir nor data_files are specified, preventing unintended fallback to the current working directory.


What component(s) does this PR affect?

  • area/datasets
  • area/load

How should the PR be classified?

  • rn/bug-fix - A user-facing bug fix

Should this be included in the next patch release?

  • Yes

@ArjunJagdale
Copy link
Contributor Author

ArjunJagdale commented Jun 17, 2025

@lhoestq Moved the logic to FolderBasedBuilder._info() as discussed in previous PR (#7618). Let me know if anything else is needed — happy to update!

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm :)

@lhoestq lhoestq merged commit b7819cd into huggingface:main Jun 18, 2025
10 of 14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FolderBase Dataset automatically resolves under current directory when data_dir is not specified
3 participants