Open
Description
We recently encountered scalability issues when trying to apply the vocabularies for multiple (5 to be exact) categorical features. We saw multiple lines of the follwoing warning message:
WARNING:tensorflow:Tables initialized inside a tf.function will be re-initialized on every invocation of the function. This re-initialization can have significant impact on performance. Consider lifting them out of the graph context using `tf.init_scope`.
When using the tft.apply_vocabulary
, the job would stuck on the transformation steps for hours, consuming thousands of CPU hours if we do not kill it early.
Creating a custom lookup table initialization function like the following could bypass the proble; 80M rows of data only took 35 min, consuming ~20 hours of CPU time.
def create_file_lookup(filename):
with tf.init_scope():
initializer = tf.lookup.TextFileInitializer(
filename,
key_dtype=tf.string,
key_index=tf.lookup.TextFileIndex.WHOLE_LINE,
value_dtype=tf.int64,
value_index=tf.lookup.TextFileIndex.LINE_NUMBER,
value_index_offset=1, # starting from 1
)
table = tf.lookup.StaticHashTable(initializer, 0)
return table
Relevant code need to be addressed:
transform/tensorflow_transform/mappers.py
Line 1114 in 520ebb4
This probably needs to be applied to versions of TFT starting from 1.0