Open
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
The BatchCoalescer
's api push_batch
incrementally builds up an array and produces a final output
GenericInProgressArray
is a generic implementation that works by bufferingArrayRef
and then callingconcat
- There are specialized implementations such as
InProgressByteViewArray
that are more efficient for certain data types (implemented in Optimize coalesce kernel for StringView (10-50% faster) #7650)
The specialized implementations can go quite a bit faster (30-50% depending)
Describe the solution you'd like
Improved performance, as measured by benchmarks for the data type named above
cargo bench --bench coalesce_kernels
Describe alternatives you've considered
For StringArray
and BinaryArray
the tricky part here will be to avoid copying the data strings as much as possible (by pre-allocating buffer space for example, and postponing the copies appropriately until the required space is known
Additional context
- the use case is described in detail here Optimize take/filter/concat from multiple input arrays to a single large output array #6692