Open
Description
There is a dataset https://www.kaggle.com/generall/oneshotwikilinks (~6GB uncompressed) that has been transformed into parquet format:
df = pd.read_csv("shuffled_dedup_entities.tsv", sep='\t', names=['concept', 'left','mention','right'])
df.to_parquet("/tmp/wikilinks.parquet")
The code shown here: https://gist.github.com/wsuchy/90fcb59d7a97e377634e096b030abfff will consume more than 120GB (which is a server max) of memory even though its only task is to read from parquet, transform, and store as TFRecords.
Is there a problem with the code, or rather TFX / Apache Beam?