Approximate vocabulary slower than vocabulary in Dataflow for large amount of features to apply vocab on

Hello, for testing reasons we wanted to see if approximate vocabulary was faster than vocabulary when there are many features (we have 36 features to analyze). In the past we hit the graph too large error in dataflow when using vocabulary (fixed using upload_graph but we still have some limits). 

However in our comparison it's at least 4 times slower and we're not sure to understand why. 

Here is our transform function: we reload the data using tfxio from a TF record dataset.

```
for key in custom_config[VOCABULARY_KEYS]: # 36 features for that dataset
        ragged = tf.RaggedTensor.from_sparse(inputs[key])

        weights = _compute_weights(ragged, inputs['nbr_target_event']) # function written in TF
        # Build a vocabulary for this feature.
        _ = tft.vocabulary(inputs[key],
                           weights=weights,
                           vocab_filename=f'{key}{VOCAB}',
                           store_frequency=True,
                           top_k=200000)
 ```
we replaced tft.vocabulary by tft.experimental.approximate_vocabulary. 


Here our stats for vocab vs approximate vocab:
- duration: 25 minutes (21.444 vCPU hr) / 2 hours (250 v CPU hour)
- batch_size_MEAN: 900 / 32 --> seems very strange
- batch_size_MAX: 1000 / 1000
- Slowest operation: TFXIOReadAndDecode[AnalysisIndex3]/RawRecordToRecordBatch/RawRecordToRecordBatch/Decode  and GroupByKey for approximate vocab. 

The CPU seems to peak very fast at 100% with 8 machines started but the throughput is way smaller. 

and we use the following beamPipelineArgs:

```
   beam_pipeline_args = [
        '--runner=DataflowRunner',
        '--no_use_public_ips',
        '--disk_size_gb=200',  
        '--num_workers=1',  # nbr initial worker
        '--autoscaling_algorithm=THROUGHPUT_BASED',
        '--network=private-experiments-network',
        '--max_num_workers=8',
        '--experiments=use_runner_v2',
        '--experiments=use_network_tags=ssh',
        "--experiments=upload_graph",
        '--temp_location=' + _temp_location,
        '--project=' + GOOGLE_CLOUD_PROJECT,
        f'--worker_harness_container_image={docker_image_full_uri}',
        '--region=' + GCP_REGION,
       '--machine_type=e2-standard-16'
    ]
```

Do you see a reason why approximate vocabulary would be that much slower than vocabulary here ? we tried reducing top_k but did not change much.

Thanks,

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Approximate vocabulary slower than vocabulary in Dataflow for large amount of features to apply vocab on #279

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Approximate vocabulary slower than vocabulary in Dataflow for large amount of features to apply vocab on #279

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions