Skip to content

Add embeddings to datasets? #85

@cleong110

Description

@cleong110

One thing I have done a number of times, manually:

  1. Download a video dataset such as ASL citizen. Usually direectly from the source so I have the .mp4 files, rather than with this library.
  2. run pose estimation on them all, foo1.mp4, foo2.mp4
  3. put those through SignCLIP, saving off the embeddng as foo1-embedded-using-asl-citizen-model.npy, foo1-embedded-using-sem-lex-model.npy, etc.
  4. backup those files somewhere.

It would be nice to have a consistent, documented way to bring all this into the sign-language-datasets ecosystem. Is there a standardized method for how to save the embeddings, load them in, etc?

Perhaps something like...

ds = tfds.load("asl-citizen")

# if they're hosted somewhere and the dataloader knows it
ds_with_embeddings = tfds.load("asl-citizen", embeddings="signclip_asl_citizen") 

# if they're hosted locally
ds_with_embeddings = tfds.load("asl-citizen", embeddings="/path/to/folder/with/embeddings") 

See also: https://www.tensorflow.org/datasets/catalog/sift1m which is a tfds with pretrained embeddings

See also also: https://www.tensorflow.org/datasets/catalog/laion400m

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions