TransformDataset doesn't process the data in paralell (uses only single worker)

When using multiple input files and `FnApiRunner / SUBPROCESS_SDK` runner:
```
pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)])
return beam.Pipeline(options=pipeline_options,
                         runner=fn_api_runner.FnApiRunner(
                             default_environment=beam_runner_api_pb2.Environment(
                                 urn=python_urns.SUBPROCESS_SDK,
                                 payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
                                         % sys.executable.encode('ascii'))))
```
the  `tft_beam.AnalyzeAndTransformDataset` really uses all workers, generates multiple output files, which makes processing quite fast (gist: `analyze_and_transform()`) .  

The `tft_beam.TransformDataset` however uses only one worker and produces only one output file (gist: `transform_only()`). This makes almost impossible to process test and validation dastasets within a reasonable amount of time. 

Is there a problem with my code or is it a bug?

GIST: https://gist.github.com/wsuchy/0c89b27a72b457ae6c904d8786658d2e
Dataset comes from https://www.kaggle.com/generall/oneshotwikilinks and has been processed using `prepare_dataset` function

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TransformDataset doesn't process the data in paralell (uses only single worker) #146

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

TransformDataset doesn't process the data in paralell (uses only single worker) #146

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions