Skip to content

Memory leak in TFX Transform when using Apache Beam runner  #143

Open
@wsuchy

Description

@wsuchy

There is a dataset https://www.kaggle.com/generall/oneshotwikilinks (~6GB uncompressed) that has been transformed into parquet format:

df = pd.read_csv("shuffled_dedup_entities.tsv", sep='\t', names=['concept', 'left','mention','right'])
df.to_parquet("/tmp/wikilinks.parquet")

The code shown here: https://gist.github.com/wsuchy/90fcb59d7a97e377634e096b030abfff will consume more than 120GB (which is a server max) of memory even though its only task is to read from parquet, transform, and store as TFRecords.

Is there a problem with the code, or rather TFX / Apache Beam?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions