Skip to content

TransformDataset doesn't process the data in paralell (uses only single worker) #146

Open
@wsuchy

Description

@wsuchy

When using multiple input files and FnApiRunner / SUBPROCESS_SDK runner:

pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)])
return beam.Pipeline(options=pipeline_options,
                         runner=fn_api_runner.FnApiRunner(
                             default_environment=beam_runner_api_pb2.Environment(
                                 urn=python_urns.SUBPROCESS_SDK,
                                 payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
                                         % sys.executable.encode('ascii'))))

the tft_beam.AnalyzeAndTransformDataset really uses all workers, generates multiple output files, which makes processing quite fast (gist: analyze_and_transform()) .

The tft_beam.TransformDataset however uses only one worker and produces only one output file (gist: transform_only()). This makes almost impossible to process test and validation dastasets within a reasonable amount of time.

Is there a problem with my code or is it a bug?

GIST: https://gist.github.com/wsuchy/0c89b27a72b457ae6c904d8786658d2e
Dataset comes from https://www.kaggle.com/generall/oneshotwikilinks and has been processed using prepare_dataset function

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions