Skip to content

apply_vocabulary lookup table initialization needs to be wrapped inside tf.init_scope #249

Open
@EdwardCuiPeacock

Description

@EdwardCuiPeacock

We recently encountered scalability issues when trying to apply the vocabularies for multiple (5 to be exact) categorical features. We saw multiple lines of the follwoing warning message:

WARNING:tensorflow:Tables initialized inside a tf.function will be re-initialized on every invocation of the function. This re-initialization can have significant impact on performance. Consider lifting them out of the graph context using `tf.init_scope`.

When using the tft.apply_vocabulary, the job would stuck on the transformation steps for hours, consuming thousands of CPU hours if we do not kill it early.

Creating a custom lookup table initialization function like the following could bypass the proble; 80M rows of data only took 35 min, consuming ~20 hours of CPU time.

def create_file_lookup(filename):
    with tf.init_scope():
        initializer = tf.lookup.TextFileInitializer(
            filename,
            key_dtype=tf.string, 
            key_index=tf.lookup.TextFileIndex.WHOLE_LINE, 
            value_dtype=tf.int64, 
            value_index=tf.lookup.TextFileIndex.LINE_NUMBER,
            value_index_offset=1, # starting from 1
        )
        table = tf.lookup.StaticHashTable(initializer, 0)
        
    return table

Relevant code need to be addressed:

initializer = tf.lookup.TextFileInitializer(

This probably needs to be applied to versions of TFT starting from 1.0

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions