Skip to content

New Preprocessing Feature - Deduplication [Request] #4448

Open
@yuvalkirstain

Description

@yuvalkirstain

Is your feature request related to a problem? Please describe.
Many large datasets are full of duplications and it has been shown that deduplicating datasets can lead to better performance while training, and more truthful evaluation at test-time.

A feature that allows one to easily deduplicate a dataset can be cool!

Describe the solution you'd like
We can define a function and keep only the first/last data-point that yields the value according to this function.

Describe alternatives you've considered
The clear alternative is to repeat a clear boilerplate every time someone want to deduplicate a dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    duplicateThis issue or pull request already existsenhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions