Description
Tensorflow version 2.8.0
, TFT version 1.7.0
.
I am currently working on constructing a module which has some multivalent inputs, as well as a multi-hot label endpoint. Both of these need similar transform and feature engineering: convert a string into tokens then map the tokens into a sequence of integers which are fed to an embedding table. However, the number of tokens in a given example string is not constant, and tft.compute_and_apply_vocabulary
seems to be unable to parse the output of tf.string.split
. In the context of the full model:
def _preprocess_multivalent_feature(feature, ncats):
raw_values = tf.strings.split(feature, ',')
coded_values = tft.compute_and_apply_vocabulary(raw_values, num_oov_buckets=1)
return coded_values
which lands me at (snipped for brevity)
TypeError Traceback (most recent call last)
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
548 try:
--> 549 str_values = [compat.as_bytes(x) for x in proto_values]
550 except TypeError:
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in <listcomp>(.0)
548 try:
--> 549 str_values = [compat.as_bytes(x) for x in proto_values]
550 except TypeError:
/opt/conda/lib/python3.7/site-packages/tensorflow/python/util/compat.py in as_bytes(bytes_or_text, encoding)
86 raise TypeError('Expected binary or unicode string, got %r' %
---> 87 (bytes_or_text,))
88
TypeError: Expected binary or unicode string, got tf.RaggedTensor(values=tf.RaggedTensor(values=Tensor("StringSplit/StringSplit/StringSplit/StringSplitV2:1", shape=(None,), dtype=string), row_splits=Tensor("StringSplit/StringSplit/StringSplit/RaggedFromValueRowIds/RowPartitionFromValueRowIds/concat:0", shape=(None,), dtype=int64)), row_splits=Tensor("StringSplit/RaggedFromTensor/RaggedFromUniformRowLength/RowPartitionFromUniformRowLength/mul:0", shape=(None,), dtype=int64))
During handling of the above exception, another exception occurred:
[snip]
/app/pipeline/components/transform.py in _preprocess_multivalent_feature(feature, ncats)
69 def _preprocess_multivalent_feature(feature):
70 raw_values = tf.strings.split(feature, ',')
---> 71 coded_values = tft.compute_and_apply_vocabulary(raw_values, num_oov_buckets=1)
72 return coded_values
[snip]
/opt/conda/lib/python3.7/site-packages/tensorflow/python/framework/tensor_util.py in make_tensor_proto(values, dtype, shape, verify_shape, allow_broadcast)
551 raise TypeError("Failed to convert object of type %s to Tensor. "
552 "Contents: %s. Consider casting elements to a "
--> 553 "supported type." % (type(values), values))
554 tensor_proto.string_val.extend(str_values)
555 return tensor_proto
TypeError: Failed to convert object of type <class 'tensorflow.python.ops.ragged.ragged_tensor.RaggedTensor'> to Tensor. Contents: tf.RaggedTensor(values=tf.RaggedTensor(values=Tensor("StringSplit/StringSplit/StringSplit/StringSplitV2:1", shape=(None,), dtype=string), row_splits=Tensor("StringSplit/StringSplit/StringSplit/RaggedFromValueRowIds/RowPartitionFromValueRowIds/concat:0", shape=(None,), dtype=int64)), row_splits=Tensor("StringSplit/RaggedFromTensor/RaggedFromUniformRowLength/RowPartitionFromUniformRowLength/mul:0", shape=(None,), dtype=int64)). Consider casting elements to a supported type.
Commenting out the tf.string.split
(and thus leaving them as whole strings) allows the pipeline execution to continue. Despite efforts, I cannot reproduce this exactly with a smaller example (this is work to scale up the pipeline through TFX, and providing that is out of the scope of the issue here). I am able to produce a RaggedTensor
with the output of a similar function in a working example with faked inputs. However, I have a hard time believing that the tensor produced by that example would be useable by the Embedding layer which it is putatively going to be connected to:
Raw data:
[{'x': ['a,b', 'b,c,d', '']}, {'x': ['a,b,c', 'd', 'e,f,g,h,i']}]
Transformed data:
[{'_x$ragged_values': array([3, 0, 0, 2, 1, 9]),
'_x$row_lengths_1': array([2, 3, 1])},
{'_x$ragged_values': array([3, 0, 2, 1, 8, 7, 6, 5, 4]),
'_x$row_lengths_1': array([3, 1, 5])}]
It is highly undesirable, though perhaps acceptable, if there is a way to generate a normal tensor from the ragged one. The change:
return coded_values -> return coded_values.to_tensor()
However, that meets the same problem, as it appears earlier in compute_and_apply_vocabulary
.
Any advice is appreciated.