apply_vocabulary lookup table initialization needs to be wrapped inside `tf.init_scope`

We recently encountered scalability issues when trying to apply the vocabularies for multiple (5 to be exact) categorical features. We saw multiple lines of the follwoing warning message:

```
WARNING:tensorflow:Tables initialized inside a tf.function will be re-initialized on every invocation of the function. This re-initialization can have significant impact on performance. Consider lifting them out of the graph context using `tf.init_scope`.
```

When using the `tft.apply_vocabulary`, the job would stuck on the transformation steps for hours, consuming thousands of CPU hours if we do not kill it early.

Creating a custom lookup table initialization function like the following could bypass the proble; 80M rows of data only took 35 min, consuming ~20 hours of CPU time.

```python
def create_file_lookup(filename):
    with tf.init_scope():
        initializer = tf.lookup.TextFileInitializer(
            filename,
            key_dtype=tf.string, 
            key_index=tf.lookup.TextFileIndex.WHOLE_LINE, 
            value_dtype=tf.int64, 
            value_index=tf.lookup.TextFileIndex.LINE_NUMBER,
            value_index_offset=1, # starting from 1
        )
        table = tf.lookup.StaticHashTable(initializer, 0)
        
    return table
```
Relevant code need to be addressed: 
https://github.com/tensorflow/transform/blob/520ebb492c2f687ff30cce22261938037384b26d/tensorflow_transform/mappers.py#L1114

This probably needs to be applied to versions of TFT starting from 1.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

apply_vocabulary lookup table initialization needs to be wrapped inside `tf.init_scope` #249

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

apply_vocabulary lookup table initialization needs to be wrapped inside tf.init_scope #249

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

apply_vocabulary lookup table initialization needs to be wrapped inside `tf.init_scope` #249