Skip to content

Add EMNIST Dataset for Image Classification (issue #129) #152

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 34 commits into from
Jun 15, 2025

Conversation

DerrickUnleashed
Copy link
Contributor

This PR adds support for the EMNIST dataset’s Balanced split, extending the existing MNIST datasets.

  • Uses the existing download and caching infrastructure for consistency and efficiency.
  • Supports familiar parameters aligned with MNIST datasets.
  • Includes comprehensive tests modeled after MNIST test suites.
  • Updates documentation with examples and references.

@DerrickUnleashed DerrickUnleashed changed the title Add EMNIST Dataset (Balanced Split) for Image Classification Add EMNIST Dataset (Balanced Split) for Image Classification (issue #129) Jun 3, 2025
@cregouby
Copy link
Collaborator

cregouby commented Jun 4, 2025

praise : Nice addition, thanks a lot @DerrickUnleashed , and congratulation for this first contribution !
question blocking : can we switch the parameter train=TRUE into split = c("byclass", "bymerge", "balanced", "letters", "digits", "mnist") ?
todo missing : additions to NAMESPACE and the man/emnist_dataset.Rd are missing from the P.R. Both should have been produce by running devtools::document(). Please run that function to have those two files in.
todo missing : Please add an entry in the NEWS.md file for this dataset
todo : Please add your name to the Authors@R list with role = c("ctb") in the DESCRIPTION file. We are so proud to have you as contributor !

@DerrickUnleashed
Copy link
Contributor Author

Thanks a lot for the warm welcome!
I've made all the requested changes:

  • Updated the parameter to use split = c(...) instead of train = TRUE.
  • Ran devtools::document() to update the NAMESPACE and man files.
  • Added an entry in NEWS.md for the EMNIST dataset.
  • Added myself to the Authors@R with role = c("ctb").

Let me know if anything else needs tweaking!

@cregouby
Copy link
Collaborator

cregouby commented Jun 4, 2025

Your P.R currently gives me

── Failure (test-dataset-mnist.R:74:5): tests for the emnist dataset ───────────
`ds <- emnist_dataset(dir, split = split)` did not throw an error.

── Failure (test-dataset-mnist.R:74:5): tests for the emnist dataset ───────────
`ds <- emnist_dataset(dir, split = split)` did not throw an error.

── Failure (test-dataset-mnist.R:74:5): tests for the emnist dataset ───────────
`ds <- emnist_dataset(dir, split = split)` did not throw an error.

── Failure (test-dataset-mnist.R:74:5): tests for the emnist dataset ───────────
`ds <- emnist_dataset(dir, split = split)` did not throw an error.

── Failure (test-dataset-mnist.R:74:5): tests for the emnist dataset ───────────
`ds <- emnist_dataset(dir, split = split)` did not throw an error.
[ FAIL 5 | WARN 0 | SKIP 0 | PASS 99 ]

Test complete

could you fix that ?

And also inline comments.

@DerrickUnleashed
Copy link
Contributor Author

Thanks for the clarification!

  • I've fixed the test so that emnist_dataset(dir, split = split) now throws as expected when the dataset hasn't been downloaded.

  • I've also removed the inline comments from the test file as requested.

Let me know if there's anything else you'd like me to update!

@cregouby
Copy link
Collaborator

cregouby commented Jun 5, 2025

Thanks !
Can you please resolve the merge conflicts with main that are mentioned in here ?

@DerrickUnleashed
Copy link
Contributor Author

Merge conflicts with main have been resolved successfully. Let me know if anything else needs attention.

@cregouby
Copy link
Collaborator

cregouby commented Jun 5, 2025

As you change Roxygen2 content, you have to re-document the project.

Plus actually, the example fails with

Error in emnist_dataset(split = "digits", download = TRUE) : 
  argument "root" is missing, with no default

Could you please fix it ?

As the download is very long, can you add an inform user message for the start of the Download ?

@DerrickUnleashed
Copy link
Contributor Author

I've made the necessary changes to address the issue with emnist_dataset(), including adding a default root and a user message to inform about the download.

However, while testing, I noticed similar issues in other dataset examples as well. Would you like me to go ahead and fix those too?

Error in mnist_dataset(download = TRUE) : 
  argument "root" is missing, with no default
Error in kmnist_dataset(download = TRUE) : 
  argument "root" is missing, with no default
Error in fashion_mnist_dataset(download = TRUE) : 
  argument "root" is missing, with no default

@DerrickUnleashed DerrickUnleashed changed the title Add EMNIST Dataset (Balanced Split) for Image Classification (issue #129) Add EMNIST Dataset for Image Classification (issue #129) Jun 5, 2025
@DerrickUnleashed
Copy link
Contributor Author

Just a quick FYI — when I run devtools::document(), it also generates changes in the following files:

  • transform_resize.Rd
  • transform_random_resized_crop.Rd
  • transform_random_perspective.Rd
  • transform_resized_crop.Rd
  • transform_perspective.Rd

I'm stashing these edits.

@DerrickUnleashed
Copy link
Contributor Author

DerrickUnleashed commented Jun 6, 2025

Suggestion : Should we consider adding .RData to the .gitignore file?

@DerrickUnleashed
Copy link
Contributor Author

I have also updated the default root as tempdir() to prevent the below errors

Error in mnist_dataset(download = TRUE) : 
  argument "root" is missing, with no default
Error in kmnist_dataset(download = TRUE) : 
  argument "root" is missing, with no default
Error in fashion_mnist_dataset(download = TRUE) : 
  argument "root" is missing, with no default

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

praise : Thanks for the addition. We are almost there, we just need to lower the test time and footprint.

@cregouby
Copy link
Collaborator

Sorry, but the skip_on_cran() is missing and your last commit despite the improvement removes the code coverage on 5 different values of the "split". We need those tests to be covered. Actually, whatever the doanwload= value, the archive is downloaded only once within the test making void any optimization (like to switch to download = FALSE). Could you please fix that ?

Copy link
Collaborator

@cregouby cregouby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

todo Sorry, but the skip_on_cran() is missing
todo your last commit despite the improvement, removes the code coverage on 5 different values of the "split =". We need those tests to be covered. Actually, whatever the download= value, the archive is downloaded only once within the test, making void any optimization (like to switch to download = FALSE). Could you please fix that ?

@DerrickUnleashed
Copy link
Contributor Author

Thanks for pointing that out. I've just added back the skip_on_cran() (it was removed by mistake in my previous commit). I've also updated the tests to restore full coverage for the different split = values. The tests now ensure that the archive is only downloaded once. Let me know if anything else needs adjustment.

@DerrickUnleashed
Copy link
Contributor Author

I understood my error I have forgotten to use the skip() command after using skip_on_cran() now it has less test footprint. (has reduced from ~15 mins to ~9 mins)

@DerrickUnleashed
Copy link
Contributor Author

@cregouby I have made the requested changes...Is there anything else that needs to be updated ?

@cregouby cregouby merged commit 52919a1 into mlverse:main Jun 15, 2025
3 checks passed
@DerrickUnleashed DerrickUnleashed deleted the feat/emnistDataset branch June 17, 2025 12:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants