Sem-Lex has duplicate metadata: deduplicate or include additional column?

## Use-Case: Recover original metadata from dataset

I wanted to use the dataloader to conveniently access the metadata and thus associate videos with signCLIP embeddings. However, as noted in the dataloader, sem-lex's metadata does not have unique video IDs. It turns out actually much of the metadata is duplicated, and those duplicates are also included in the dataloader.

### 2974 duplicates in the metadata
There are 91149-88175 = 2974 repeated IDs.
```
# use cut to take only column 2 (video_id)
cut -d "," semlex_metadata.csv -f2|head -n 3
video_id
uhdBQ9cLSPTCAkvOj6ko
vw74HcbvAlKFkp8et5fH

# count values
cut -d "," semlex_metadata.csv -f2|wc -l
91149

# count unique values (sort, then find unique lines, then count lines)
cut -d "," semlex_metadata.csv -f2|sort|uniq|wc -l
88175
```

But in fact, entire rows are repeated, even when including all fields. 

```
# look at columns 2-50, the first 3 lines only
cut -d "," semlex_metadata.csv -f2-50|head -n 3
video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
vw74HcbvAlKFkp8et5fH,62,962.0,train,asllex,analyze,v,im,Fully Open,1.0,1.0,0.0,Closed,0.0,Symmetrical Or Alternating,Curved,1.0,Neutral,Neutral,Neutral,1.0,v,0.0,812.0

# take columns 2 through 50, sort and take unique values and count
cut -d "," semlex_metadata.csv -f2-50|sort|uniq|wc -l
89498
```

### 2973 Duplicates in the dataloader?

If I do tfds.load with_info I can get the counts of items, the snippet below prints 91148

```
config = SignDatasetConfig(name="only_annotations", include_pose=None, include_video=False)
dataset, info = tfds.load("SemLex", builder_kwargs=dict(config=config), with_info=True)
total_count = sum([info.splits[split].num_examples for split in info.splits])
print(total_count)
```

### The metadata appears to be a Pandas dataframe, with ~~unique~~ nonunique IDs

We can see that the original has a column without any name, just numbers. That's characteristics of [pd.to_csv()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html)
```
head -n 2 semlex_metadata.csv 
,video_id,signer_id,duration,split,label_type,label,Handshape,Selected Fingers,Flexion,Flexion Change,Spread,Spread Change,Thumb Position,Thumb Contact,Sign Type,Path Movement,Repeated Movement,Major Location,Minor Location,Second Minor Location,Contact,Nondominant Handshape,Wrist Twist,SignBank Reference ID
85133,uhdBQ9cLSPTCAkvOj6ko,42,1105.0,train,freetext,ran,,,,,,,,,,,,,,,,,,
```

However these are not unique either:
```
# the first few look like this
cut -d "," semlex_metadata.csv -f1|head -n 3

85133
52435

# sort and take unique values and count
cut -d "," semlex_metadata.csv -f1|sort|uniq|wc -l
78295
```

### The combination of pandas id and video_id IS unique
```
# take only the pandas id and the video id, then count unique values
cut -d "," semlex_metadata.csv -f1,2|sort|uniq|wc -l
91149
```

## Suggestions:
* Combine the pandas ID and the video ID so that the exact items in the CSV can be recovered from the dataset
* Deduplicate? 

One thing to confirm is whether the .npy files themselves are redundant. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

Use-Case: Recover original metadata from dataset

2974 duplicates in the metadata

2973 Duplicates in the dataloader?

The metadata appears to be a Pandas dataframe, with unique nonunique IDs

The combination of pandas id and video_id IS unique

Suggestions:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Sem-Lex has duplicate metadata: deduplicate or include additional column? #83

Description

Use-Case: Recover original metadata from dataset

2974 duplicates in the metadata

2973 Duplicates in the dataloader?

The metadata appears to be a Pandas dataframe, with unique nonunique IDs

The combination of pandas id and video_id IS unique

Suggestions:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions