GH-7686: [Parquet] Fix int96 min/max stats #7687

rahulketch · 2025-06-17T14:03:56Z

Which issue does this PR close?

Closes Parquet: Incorrect min/max stats for int96 columns #7686

Rationale for this change

int96 min/max statistics emitted by arrow-rs are incorrect.

What changes are included in this PR?

Fix the int96 stats
Add round-trip test to verify the behavior

Not included in this PR:

Read stats only from known good writers. This will be implemented after a new arrow-rs release.

Are there any user-facing changes?

The int96 min/max statistics will be different and correct.

…stats

alkis · 2025-06-17T14:29:59Z

parquet/src/data_type.rs

@@ -33,7 +33,7 @@ use crate::util::bit_util::FromBytes;

 /// Rust representation for logical type INT96, value is backed by an array of `u32`.
 /// The type only takes 12 bytes, without extra padding.
-#[derive(Clone, Copy, Debug, PartialOrd, Default, PartialEq, Eq)]
+#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
 pub struct Int96 {
    value: [u32; 3],


If we store days i32 and nanos i64 here, I feel the rest of the code will be a lot simpler.

Doesn't this change the alignment of Int96 from 4 bytes to 8 bytes? What are the broader implications of that, particularly in an array/vector of Int96? Won't there be 4 bytes of padding?

I did make the change in this commit, but I am not sure if it's a good approach. @alkis : what do you think?
6036398

To preserve alignment to 4 byte boundary we can use #[repr(packed)] and let the compiler do the bit fiddling.

That will lead to misaligned i64 fields. From the ML it seems like this is a fix for legacy data while trying to convince the spark crowd to stop producing INT96 timestamps. As such, I'd opt for the minimum set of changes needed and leave value as it was.

etseidl · 2025-06-17T17:39:31Z

I tend to agree with @emkornfield (apache/parquet-java#3243 (comment)) that this is a bit of putting the cart before the horse. The sort order for INT96 is currently undefined so statistics should be ignored. I think we need changes to the Parquet spec before proceeding with this.

emkornfield · 2025-06-17T17:43:18Z

I think the plan is to also open a PR for the spec change, IIUC @alkis or @rahulketch will propose a PR there and start discussion thread (this second implementation is to fulfill the two implementation requirement).

etseidl · 2025-06-17T17:45:19Z

I think the plan is to also open a PR for the spec change, IIUC @alkis or @rahulketch will propose a PR there and start discussion thread (this second implementation is to fulfill the two implementation requirement).

Any objection to marking this "draft" then?

alkis · 2025-06-18T07:07:27Z

parquet/src/data_type.rs

    }

-    /// Returns underlying data as slice of [`u32`].
+    /// Returns underlying data as slice of [`u32`] for compatibility with Parquet format
    #[inline]
    pub fn data(&self) -> &[u32] {


Can we change set_data to take a slice of u8? Then we can read the first 8 bytes as little endian nanos and the last 4 bytes as little endian julian date. We should remove data and make it do the inverse and output into a slice of u8.

What is done here today is wrong in terms of endianess.

alkis · 2025-06-18T07:08:52Z

parquet/src/data_type.rs

+        // 1. The memory layout is compatible (12 bytes total)
+        // 2. The alignment requirements are met (u32 requires 4-byte alignment)
+        // 3. We maintain the invariant that the bytes are always valid u32s
+        unsafe { std::slice::from_raw_parts(self as *const Int96 as *const u32, 3) }


This is not safe because the ordering of Rust structs is not guaranteed.

alamb · 2025-06-19T10:19:52Z

Thank you @rahulketch

Here is a related issue PR in Spark to stop writing INT96 timestamps

[SPARK-51359][CORE][SQL] Set INT64 as the default timestamp type for Parquet files spark#50215

I am kind of confused about the current status of Int96 -- the parquet spec says they are deprecated but spark keeps writing them and this PR (and others) seem to imply Spark / Databricks plans to keep writing INT96 timestamps indefinitely.

Here is a related mailing list discussion on this topic: https://lists.apache.org/thread/6fm50b3pmh6mz659jb5wx5vzmvwccz1n

As @emkornfield pointed out on that discussion, the spec explicitly says the sort order for INT96 types is undefined:

https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L1079

Perhaps we should also update the spec to reflect whatever is desired as part of change the parquet writers?

emkornfield

Sorry if I missed it but does Rust RS already filter have an allow/deny list for int96 stats when reading them? If not it seems like that needs to be added to the PR (or we can wait to resolve the higher level concerns on the mailing list)

alamb · 2025-06-20T13:53:27Z

Sorry if I missed it but does Rust RS already filter have an allow/deny list for int96 stats when reading them? If not it seems like that needs to be added to the PR (or we can wait to resolve the higher level concerns on the mailing list)

I am not aware of any such feature, but I am not quite sure what an allow/deny list means

The low level API will return the values as written, see https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html

Specifically it will return Statistics::Int96(ValueStatistics<Int96>) I believe

etseidl · 2025-06-20T15:43:39Z

I think @emkornfield means will parquet-rs ignore INT96 statistics on read and not write them. The sort order is undefined, so I think any behavior, so long as it's consistent, is ok per the spec. But, as with many things, I think there's an inconsistency between the arrow and record APIs. It seems the arrow API will refuse to write INT96

arrow-rs/parquet/src/arrow/arrow_writer/mod.rs

Lines 1141 to 1143 in 1bed04c

    
           ColumnWriter::Int96ColumnWriter(ref mut _typed) => { 
        
               unreachable!("Currently unreachable because data type not supported") 
        
           }

AFAICT the record API will, and the statistics written will be ordered as Vec<u32>, which is not what's desired here. (see this test for instance).

On read I believe both will treat statistics "properly" (i.e. the Int96 type will be interpreted as little endian int96, with 4 byte days followed by 8 byte nanos), but the arrow API will promptly cast to some type of timestamp or error.

In the short term it might be best to have this crate mimic parquet-java and ignore INT96 statistics if present and refuse to write them at all. We can revisit this PR if the community comes to a consensus and un-deprecates the type, or at least standardizes it rather than relying on Spark's or Impala's implementation.

emkornfield · 2025-06-20T16:46:44Z

am not aware of any such feature, but I am not quite sure what an allow/deny list means

@etseidl had a good summary, https://github.com/apache/parquet-java/pull/3243/files#r2155295422 effectively we want to look at writer versions to determine if statistics are valid. See the corresponding java PR as an example.

Rahul Sharma added 14 commits April 15, 2025 15:51

Add int96 stats test

3b371ac

fix conversions

fd51210

asserts

72687dd

printing change

46d98e8

Create int96 from time since epoch

93a780c

Add ways to set int96 from timestamps

a5b9eb7

simplify

796243b

Add correct ordering for int96

34c928d

Add tests for int96ordering

eb8a77c

rename tests

ef07163

Simplify test

6abd75f

Refactor test

6108b14

simplify test

5155556

Merge remote-tracking branch 'origin/main' into add-tests-for-int-96-…

abc743a

…stats

github-actions bot added the parquet Changes to the parquet crate label Jun 17, 2025

rahulketch changed the title ~~GH-7686 [Parquet] Fix int96 min/max stats~~ GH-7686: [Parquet] Fix int96 min/max stats Jun 17, 2025

Rahul Sharma added 2 commits June 17, 2025 14:13

Improve testcase

4b0b94d

shuffle data before writing to file

325d335

alkis reviewed Jun 17, 2025

View reviewed changes

rahulketch force-pushed the add-tests-for-int-96-stats branch 3 times, most recently from ede2b9a to 63a5fd5 Compare June 17, 2025 15:55

change int96 internal format

6036398

rahulketch force-pushed the add-tests-for-int-96-stats branch from 63a5fd5 to 6036398 Compare June 17, 2025 16:07

etseidl added enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jun 17, 2025

etseidl added the next-major-release the PR has API changes and it waiting on the next major version label Jun 17, 2025

etseidl marked this pull request as draft June 17, 2025 19:29

alkis reviewed Jun 18, 2025

View reviewed changes

emkornfield requested changes Jun 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-7686: [Parquet] Fix int96 min/max stats #7687

GH-7686: [Parquet] Fix int96 min/max stats #7687

Uh oh!

rahulketch commented Jun 17, 2025 •

edited by alamb

Loading

Uh oh!

alkis Jun 17, 2025 •

edited

Loading

Uh oh!

mbutrovich Jun 17, 2025

Uh oh!

rahulketch Jun 17, 2025

Uh oh!

alkis Jun 18, 2025

Uh oh!

etseidl Jun 18, 2025

Uh oh!

etseidl commented Jun 17, 2025

Uh oh!

emkornfield commented Jun 17, 2025

Uh oh!

etseidl commented Jun 17, 2025

Uh oh!

alkis Jun 18, 2025 •

edited

Loading

Uh oh!

alkis Jun 18, 2025

Uh oh!

alamb commented Jun 19, 2025

Uh oh!

emkornfield left a comment

Uh oh!

alamb commented Jun 20, 2025

Uh oh!

etseidl commented Jun 20, 2025

Uh oh!

emkornfield commented Jun 20, 2025

Uh oh!

Uh oh!

GH-7686: [Parquet] Fix int96 min/max stats #7687

Are you sure you want to change the base?

GH-7686: [Parquet] Fix int96 min/max stats #7687

Uh oh!

Conversation

rahulketch commented Jun 17, 2025 • edited by alamb Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Not included in this PR:

Are there any user-facing changes?

Uh oh!

alkis Jun 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mbutrovich Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

rahulketch Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

alkis Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

etseidl commented Jun 17, 2025

Uh oh!

emkornfield commented Jun 17, 2025

Uh oh!

etseidl commented Jun 17, 2025

Uh oh!

alkis Jun 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alkis Jun 18, 2025

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 19, 2025

Uh oh!

emkornfield left a comment

Choose a reason for hiding this comment

Uh oh!

alamb commented Jun 20, 2025

Uh oh!

etseidl commented Jun 20, 2025

Uh oh!

emkornfield commented Jun 20, 2025

Uh oh!

Uh oh!

rahulketch commented Jun 17, 2025 •

edited by alamb

Loading

alkis Jun 17, 2025 •

edited

Loading

alkis Jun 18, 2025 •

edited

Loading