Skip to content

Sem-Lex has duplicate metadata: deduplicate or include additional column?Β #83

@cleong110

Description

@cleong110

Use-Case: Recover original metadata from dataset

I wanted to use the dataloader to conveniently access the metadata and thus associate videos with signCLIP embeddings. However, as noted in the dataloader, sem-lex's metadata does not have unique video IDs. It turns out actually much of the metadata is duplicated, and those duplicates are also included in the dataloader.

2974 duplicates in the metadata

There are 91149-88175 = 2974 repeated IDs.

# use cut to take only column 2 (video_id)
cut -d "," semlex_metadata.csv -f2|head -n 3
video_id
uhdBQ9cLSPTCAkvOj6ko
vw74HcbvAlKFkp8et5fH

# count values
cut -d "," semlex_metadata.csv -f2|wc -l
91149

# count unique values (sort, then find unique lines, then count lines)
cut -d "," semlex_metadata.csv -f2|sort|uniq|wc -l
88175

But in fact, entire rows are repeated, even when including all fields.

# look at columns 2-50, the first 3 lines only
cut -d "," semlex_metadata.csv -f2-50|head -n 3
video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
vw74HcbvAlKFkp8et5fH,62,962.0,train,asllex,analyze,v,im,Fully Open,1.0,1.0,0.0,Closed,0.0,Symmetrical Or Alternating,Curved,1.0,Neutral,Neutral,Neutral,1.0,v,0.0,812.0

# take columns 2 through 50, sort and take unique values and count
cut -d "," semlex_metadata.csv -f2-50|sort|uniq|wc -l
89498

2973 Duplicates in the dataloader?

If I do tfds.load with_info I can get the counts of items, the snippet below prints 91148

config = SignDatasetConfig(name="only_annotations", include_pose=None, include_video=False)
dataset, info = tfds.load("SemLex", builder_kwargs=dict(config=config), with_info=True)
total_count = sum([info.splits[split].num_examples for split in info.splits])
print(total_count)

The metadata appears to be a Pandas dataframe, with unique nonunique IDs

We can see that the original has a column without any name, just numbers. That's characteristics of pd.to_csv()

head -n 2 semlex_metadata.csv 
,video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
85133,uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,

However these are not unique either:

# the first few look like this
cut -d "," semlex_metadata.csv -f1|head -n 3

85133
52435

# sort and take unique values and count
cut -d "," semlex_metadata.csv -f1|sort|uniq|wc -l
78295

The combination of pandas id and video_id IS unique

# take only the pandas id and the video id, then count unique values
cut -d "," semlex_metadata.csv -f1,2|sort|uniq|wc -l
91149

Suggestions:

  • Combine the pandas ID and the video ID so that the exact items in the CSV can be recovered from the dataset
  • Deduplicate?

One thing to confirm is whether the .npy files themselves are redundant.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions