Open
Description
Is your feature request related to a problem? Please describe.
Many large datasets are full of duplications and it has been shown that deduplicating datasets can lead to better performance while training, and more truthful evaluation at test-time.
A feature that allows one to easily deduplicate a dataset can be cool!
Describe the solution you'd like
We can define a function and keep only the first/last data-point that yields the value according to this function.
Describe alternatives you've considered
The clear alternative is to repeat a clear boilerplate every time someone want to deduplicate a dataset.