Add low level support for shredding and unshredding

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**

The variant shredding specification allows for variant values to be "shredded" where part of the overall variant is strongly-typed and part is normal binary variant. Working with shredded variant values requires writers to pull out specific subsets of a variant object that match a target schema ("shredding"). It also requires readers to potentially "unshred" by injecting strongly typed values back into the binary variant they came from. 

Partial shredding of object values increases the complexity significantly -- some fields of an object could be shredded out as a struct, while others are not. And so on, recursively.

NOTE: The specification mandates that the variant metadata dictionary must contain all path parts, regardless of whether a given path is shredded or not. So (un)shredding operations do not need to modify the variant metadata dictionary.

**Describe the solution you'd like**

Ultimately, shredding and unshredding will be a problem for arrow-array and/or arrow-compute to solve (see below). But those higher level operations will need low-level support from `Variant` and its decoders/builders in order to do their work.

We should start figuring out what that low-level support looks like. A likely starting point would be the ability to insert and remove specific variant values from an existing variant object. These should be cheap byte-shuffling operations that don't waste time introspecting unrelated parts of the variant value buffer. And it needs to be efficient even when doing recursive inserts and removes as part of a partial (un)shredding operation.

At the higher level: 

The parquet reader and writer will just use whatever shredding schema they receive from the parquet footer or user, respectively. No special low-level variant support needed there. But a user wishing to write shredded parquet will need a way to convert an Array of binary variant values into an Array of shredded variant values, or a strongly typed Array (e.g. StructArray) into an Array of shredded variant values. And a user wishing to read shredded parquet will will need a way to convert an Array of shredded variant values (with a specific shredding schema) to an Array of binary variant values, or an Array of shredded variant values having a different shredding schema, or a strongly typed Array (e.g. StructArray). 

**Describe alternatives you've considered**

Just starting to think about this, and realizing we should probably start figuring out the low-level building blocks that arrow-array will eventually rely on. Now that we actually have variant builders and decoders, we can probably make progress here.

**Additional context**

https://github.com/apache/parquet-format/blob/master/VariantShredding.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add low level support for shredding and unshredding #7715

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add low level support for shredding and unshredding #7715

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions