Skip to content

tft.compute_and_apply_vocabulary is not robust to RaggedTensor type #265

Open
@chrispankow

Description

@chrispankow

Tensorflow version 2.8.0, TFT version 1.7.0.

I am currently working on constructing a module which has some multivalent inputs, as well as a multi-hot label endpoint. Both of these need similar transform and feature engineering: convert a string into tokens then map the tokens into a sequence of integers which are fed to an embedding table. However, the number of tokens in a given example string is not constant, and tft.compute_and_apply_vocabulary seems to be unable to parse the output of tf.string.split. In the context of the full model:

def _preprocess_multivalent_feature(feature, ncats):                             
    raw_values = tf.strings.split(feature, ',')                                  
    coded_values = tft.compute_and_apply_vocabulary(raw_values, num_oov_buckets=1)
    return coded_values

which lands me at (snipped for brevity)

TypeError                                 Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    548     try:
--> 549       str_values = [compat.as_bytes(x) for x in proto_values]
    550     except TypeError:

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in <listcomp>(.0)
    548     try:
--> 549       str_values = [compat.as_bytes(x) for x in proto_values]
    550     except TypeError:

/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/compat.py in as_bytes(bytes_or_text, encoding)
     86     raise TypeError('Expected binary or unicode string, got %r' %
---> 87                     (bytes_or_text,))
     88 

TypeError: Expected binary or unicode string, got tf.RaggedTensor(values=tf.RaggedTensor(values=Tensor("StringSplit/StringSplit/StringSplit/StringSplitV2:1", shape=(None,), dtype=string), row_splits=Tensor("StringSplit/StringSplit/StringSplit/RaggedFromValueRowIds/RowPartitionFromValueRowIds/concat:0", shape=(None,), dtype=int64)), row_splits=Tensor("StringSplit/RaggedFromTensor/RaggedFromUniformRowLength/RowPartitionFromUniformRowLength/mul:0", shape=(None,), dtype=int64))

During handling of the above exception, another exception occurred:

[snip]

/app/pipeline/components/transform.py in _preprocess_multivalent_feature(feature, ncats)
     69 def _preprocess_multivalent_feature(feature):
     70     raw_values = tf.strings.split(feature, ',')
---> 71     coded_values = tft.compute_and_apply_vocabulary(raw_values, num_oov_buckets=1)
     72     return coded_values

[snip]

/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
    551       raise TypeError("Failed to convert object of type %s to Tensor. "
    552                       "Contents: %s. Consider casting elements to a "
--> 553                       "supported type." % (type(values), values))
    554     tensor_proto.string_val.extend(str_values)
    555     return tensor_proto

TypeError: Failed to convert object of type <class 'tensorflow.python.ops.ragged.ragged_tensor.RaggedTensor'> to Tensor. Contents: tf.RaggedTensor(values=tf.RaggedTensor(values=Tensor("StringSplit/StringSplit/StringSplit/StringSplitV2:1", shape=(None,), dtype=string), row_splits=Tensor("StringSplit/StringSplit/StringSplit/RaggedFromValueRowIds/RowPartitionFromValueRowIds/concat:0", shape=(None,), dtype=int64)), row_splits=Tensor("StringSplit/RaggedFromTensor/RaggedFromUniformRowLength/RowPartitionFromUniformRowLength/mul:0", shape=(None,), dtype=int64)). Consider casting elements to a supported type.

Commenting out the tf.string.split (and thus leaving them as whole strings) allows the pipeline execution to continue. Despite efforts, I cannot reproduce this exactly with a smaller example (this is work to scale up the pipeline through TFX, and providing that is out of the scope of the issue here). I am able to produce a RaggedTensor with the output of a similar function in a working example with faked inputs. However, I have a hard time believing that the tensor produced by that example would be useable by the Embedding layer which it is putatively going to be connected to:

Raw data:
[{'x': ['a,b', 'b,c,d', '']}, {'x': ['a,b,c', 'd', 'e,f,g,h,i']}]

Transformed data:
[{'_x$ragged_values': array([3, 0, 0, 2, 1, 9]),
  '_x$row_lengths_1': array([2, 3, 1])},
 {'_x$ragged_values': array([3, 0, 2, 1, 8, 7, 6, 5, 4]),
  '_x$row_lengths_1': array([3, 1, 5])}]

It is highly undesirable, though perhaps acceptable, if there is a way to generate a normal tensor from the ragged one. The change:

return coded_values -> return coded_values.to_tensor()

However, that meets the same problem, as it appears earlier in compute_and_apply_vocabulary.

Any advice is appreciated.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions