Skip to content

GH-7686: [Parquet] Fix int96 min/max stats #7687

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

rahulketch
Copy link

@rahulketch rahulketch commented Jun 17, 2025

Which issue does this PR close?

Rationale for this change

int96 min/max statistics emitted by arrow-rs are incorrect.

What changes are included in this PR?

  1. Fix the int96 stats
  2. Add round-trip test to verify the behavior

Not included in this PR:

  1. Read stats only from known good writers. This will be implemented after a new arrow-rs release.

Are there any user-facing changes?

The int96 min/max statistics will be different and correct.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jun 17, 2025
@rahulketch rahulketch changed the title GH-7686 [Parquet] Fix int96 min/max stats GH-7686: [Parquet] Fix int96 min/max stats Jun 17, 2025
@@ -33,7 +33,7 @@ use crate::util::bit_util::FromBytes;

/// Rust representation for logical type INT96, value is backed by an array of `u32`.
/// The type only takes 12 bytes, without extra padding.
#[derive(Clone, Copy, Debug, PartialOrd, Default, PartialEq, Eq)]
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)]
pub struct Int96 {
value: [u32; 3],
Copy link

@alkis alkis Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we store days i32 and nanos i64 here, I feel the rest of the code will be a lot simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this change the alignment of Int96 from 4 bytes to 8 bytes? What are the broader implications of that, particularly in an array/vector of Int96? Won't there be 4 bytes of padding?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did make the change in this commit, but I am not sure if it's a good approach. @alkis : what do you think?
6036398

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To preserve alignment to 4 byte boundary we can use #[repr(packed)] and let the compiler do the bit fiddling.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That will lead to misaligned i64 fields. From the ML it seems like this is a fix for legacy data while trying to convince the spark crowd to stop producing INT96 timestamps. As such, I'd opt for the minimum set of changes needed and leave value as it was.

@rahulketch rahulketch force-pushed the add-tests-for-int-96-stats branch 3 times, most recently from ede2b9a to 63a5fd5 Compare June 17, 2025 15:55
@rahulketch rahulketch force-pushed the add-tests-for-int-96-stats branch from 63a5fd5 to 6036398 Compare June 17, 2025 16:07
@etseidl
Copy link
Contributor

etseidl commented Jun 17, 2025

I tend to agree with @emkornfield (apache/parquet-java#3243 (comment)) that this is a bit of putting the cart before the horse. The sort order for INT96 is currently undefined so statistics should be ignored. I think we need changes to the Parquet spec before proceeding with this.

@etseidl etseidl added enhancement Any new improvement worthy of a entry in the changelog api-change Changes to the arrow API next-major-release the PR has API changes and it waiting on the next major version labels Jun 17, 2025
@etseidl etseidl added the next-major-release the PR has API changes and it waiting on the next major version label Jun 17, 2025
@emkornfield
Copy link
Contributor

I think the plan is to also open a PR for the spec change, IIUC @alkis or @rahulketch will propose a PR there and start discussion thread (this second implementation is to fulfill the two implementation requirement).

@etseidl
Copy link
Contributor

etseidl commented Jun 17, 2025

I think the plan is to also open a PR for the spec change, IIUC @alkis or @rahulketch will propose a PR there and start discussion thread (this second implementation is to fulfill the two implementation requirement).

Any objection to marking this "draft" then?

@etseidl etseidl marked this pull request as draft June 17, 2025 19:29
}

/// Returns underlying data as slice of [`u32`].
/// Returns underlying data as slice of [`u32`] for compatibility with Parquet format
#[inline]
pub fn data(&self) -> &[u32] {
Copy link

@alkis alkis Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change set_data to take a slice of u8? Then we can read the first 8 bytes as little endian nanos and the last 4 bytes as little endian julian date. We should remove data and make it do the inverse and output into a slice of u8.

What is done here today is wrong in terms of endianess.

// 1. The memory layout is compatible (12 bytes total)
// 2. The alignment requirements are met (u32 requires 4-byte alignment)
// 3. We maintain the invariant that the bytes are always valid u32s
unsafe { std::slice::from_raw_parts(self as *const Int96 as *const u32, 3) }
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe because the ordering of Rust structs is not guaranteed.

@alamb
Copy link
Contributor

alamb commented Jun 19, 2025

Thank you @rahulketch

Here is a related issue PR in Spark to stop writing INT96 timestamps

I am kind of confused about the current status of Int96 -- the parquet spec says they are deprecated but spark keeps writing them and this PR (and others) seem to imply Spark / Databricks plans to keep writing INT96 timestamps indefinitely.

Here is a related mailing list discussion on this topic: https://lists.apache.org/thread/6fm50b3pmh6mz659jb5wx5vzmvwccz1n

As @emkornfield pointed out on that discussion, the spec explicitly says the sort order for INT96 types is undefined:

https://github.com/apache/parquet-format/blob/87f2c8bf77eefb4c43d0ebaeea1778bd28ac3609/src/main/thrift/parquet.thrift#L1079

Perhaps we should also update the spec to reflect whatever is desired as part of change the parquet writers?

Copy link
Contributor

@emkornfield emkornfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry if I missed it but does Rust RS already filter have an allow/deny list for int96 stats when reading them? If not it seems like that needs to be added to the PR (or we can wait to resolve the higher level concerns on the mailing list)

@alamb
Copy link
Contributor

alamb commented Jun 20, 2025

Sorry if I missed it but does Rust RS already filter have an allow/deny list for int96 stats when reading them? If not it seems like that needs to be added to the PR (or we can wait to resolve the higher level concerns on the mailing list)

I am not aware of any such feature, but I am not quite sure what an allow/deny list means

The low level API will return the values as written, see https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html

Specifically it will return Statistics::Int96(ValueStatistics<Int96>) I believe

@etseidl
Copy link
Contributor

etseidl commented Jun 20, 2025

I think @emkornfield means will parquet-rs ignore INT96 statistics on read and not write them. The sort order is undefined, so I think any behavior, so long as it's consistent, is ok per the spec. But, as with many things, I think there's an inconsistency between the arrow and record APIs. It seems the arrow API will refuse to write INT96

ColumnWriter::Int96ColumnWriter(ref mut _typed) => {
unreachable!("Currently unreachable because data type not supported")
}

AFAICT the record API will, and the statistics written will be ordered as Vec<u32>, which is not what's desired here. (see this test for instance).

On read I believe both will treat statistics "properly" (i.e. the Int96 type will be interpreted as little endian int96, with 4 byte days followed by 8 byte nanos), but the arrow API will promptly cast to some type of timestamp or error.

In the short term it might be best to have this crate mimic parquet-java and ignore INT96 statistics if present and refuse to write them at all. We can revisit this PR if the community comes to a consensus and un-deprecates the type, or at least standardizes it rather than relying on Spark's or Impala's implementation.

@emkornfield
Copy link
Contributor

am not aware of any such feature, but I am not quite sure what an allow/deny list means

@etseidl had a good summary, https://github.com/apache/parquet-java/pull/3243/files#r2155295422 effectively we want to look at writer versions to determine if statistics are valid. See the corresponding java PR as an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api-change Changes to the arrow API enhancement Any new improvement worthy of a entry in the changelog next-major-release the PR has API changes and it waiting on the next major version parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Parquet: Incorrect min/max stats for int96 columns
6 participants