Open
Description
When using multiple input files and FnApiRunner / SUBPROCESS_SDK
runner:
pipeline_options = PipelineOptions(['--direct_num_workers', str(workers)])
return beam.Pipeline(options=pipeline_options,
runner=fn_api_runner.FnApiRunner(
default_environment=beam_runner_api_pb2.Environment(
urn=python_urns.SUBPROCESS_SDK,
payload=b'%s -m apache_beam.runners.worker.sdk_worker_main'
% sys.executable.encode('ascii'))))
the tft_beam.AnalyzeAndTransformDataset
really uses all workers, generates multiple output files, which makes processing quite fast (gist: analyze_and_transform()
) .
The tft_beam.TransformDataset
however uses only one worker and produces only one output file (gist: transform_only()
). This makes almost impossible to process test and validation dastasets within a reasonable amount of time.
Is there a problem with my code or is it a bug?
GIST: https://gist.github.com/wsuchy/0c89b27a72b457ae6c904d8786658d2e
Dataset comes from https://www.kaggle.com/generall/oneshotwikilinks and has been processed using prepare_dataset
function