Memory leak in TFX Transform when using Apache Beam runner 

There is a dataset https://www.kaggle.com/generall/oneshotwikilinks (~6GB uncompressed) that has been transformed into parquet format:

```
df = pd.read_csv("shuffled_dedup_entities.tsv", sep='\t', names=['concept', 'left','mention','right'])
df.to_parquet("/tmp/wikilinks.parquet")
```
The code shown here: https://gist.github.com/wsuchy/90fcb59d7a97e377634e096b030abfff will consume more than 120GB (which is a server max) of memory even though its only task is to read from parquet, transform, and store as TFRecords. 

Is there a problem with the code, or rather TFX / Apache Beam?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory leak in TFX Transform when using Apache Beam runner #143

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Memory leak in TFX Transform when using Apache Beam runner #143

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions