Skip to content

Add low level support for shredding and unshredding #7715

Open
@scovich

Description

@scovich

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

The variant shredding specification allows for variant values to be "shredded" where part of the overall variant is strongly-typed and part is normal binary variant. Working with shredded variant values requires writers to pull out specific subsets of a variant object that match a target schema ("shredding"). It also requires readers to potentially "unshred" by injecting strongly typed values back into the binary variant they came from.

Partial shredding of object values increases the complexity significantly -- some fields of an object could be shredded out as a struct, while others are not. And so on, recursively.

NOTE: The specification mandates that the variant metadata dictionary must contain all path parts, regardless of whether a given path is shredded or not. So (un)shredding operations do not need to modify the variant metadata dictionary.

Describe the solution you'd like

Ultimately, shredding and unshredding will be a problem for arrow-array and/or arrow-compute to solve (see below). But those higher level operations will need low-level support from Variant and its decoders/builders in order to do their work.

We should start figuring out what that low-level support looks like. A likely starting point would be the ability to insert and remove specific variant values from an existing variant object. These should be cheap byte-shuffling operations that don't waste time introspecting unrelated parts of the variant value buffer. And it needs to be efficient even when doing recursive inserts and removes as part of a partial (un)shredding operation.

At the higher level:

The parquet reader and writer will just use whatever shredding schema they receive from the parquet footer or user, respectively. No special low-level variant support needed there. But a user wishing to write shredded parquet will need a way to convert an Array of binary variant values into an Array of shredded variant values, or a strongly typed Array (e.g. StructArray) into an Array of shredded variant values. And a user wishing to read shredded parquet will will need a way to convert an Array of shredded variant values (with a specific shredding schema) to an Array of binary variant values, or an Array of shredded variant values having a different shredding schema, or a strongly typed Array (e.g. StructArray).

Describe alternatives you've considered

Just starting to think about this, and realizing we should probably start figuring out the low-level building blocks that arrow-array will eventually rely on. Now that we actually have variant builders and decoders, we can probably make progress here.

Additional context

https://github.com/apache/parquet-format/blob/master/VariantShredding.md

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions