-
Notifications
You must be signed in to change notification settings - Fork 958
GH-7686: [Parquet] Fix int96 min/max stats #7687
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
parquet/src/data_type.rs
Outdated
@@ -33,7 +33,7 @@ use crate::util::bit_util::FromBytes; | |||
|
|||
/// Rust representation for logical type INT96, value is backed by an array of `u32`. | |||
/// The type only takes 12 bytes, without extra padding. | |||
#[derive(Clone, Copy, Debug, PartialOrd, Default, PartialEq, Eq)] | |||
#[derive(Clone, Copy, Debug, Default, PartialEq, Eq)] | |||
pub struct Int96 { | |||
value: [u32; 3], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we store days i32
and nanos i64
here, I feel the rest of the code will be a lot simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this change the alignment of Int96 from 4 bytes to 8 bytes? What are the broader implications of that, particularly in an array/vector of Int96? Won't there be 4 bytes of padding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To preserve alignment to 4 byte boundary we can use #[repr(packed)]
and let the compiler do the bit fiddling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That will lead to misaligned i64
fields. From the ML it seems like this is a fix for legacy data while trying to convince the spark crowd to stop producing INT96
timestamps. As such, I'd opt for the minimum set of changes needed and leave value
as it was.
ede2b9a
to
63a5fd5
Compare
63a5fd5
to
6036398
Compare
I tend to agree with @emkornfield (apache/parquet-java#3243 (comment)) that this is a bit of putting the cart before the horse. The sort order for |
I think the plan is to also open a PR for the spec change, IIUC @alkis or @rahulketch will propose a PR there and start discussion thread (this second implementation is to fulfill the two implementation requirement). |
Any objection to marking this "draft" then? |
} | ||
|
||
/// Returns underlying data as slice of [`u32`]. | ||
/// Returns underlying data as slice of [`u32`] for compatibility with Parquet format | ||
#[inline] | ||
pub fn data(&self) -> &[u32] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change set_data
to take a slice of u8
? Then we can read the first 8 bytes as little endian nanos and the last 4 bytes as little endian julian date. We should remove data
and make it do the inverse and output into a slice of u8
.
What is done here today is wrong in terms of endianess.
// 1. The memory layout is compatible (12 bytes total) | ||
// 2. The alignment requirements are met (u32 requires 4-byte alignment) | ||
// 3. We maintain the invariant that the bytes are always valid u32s | ||
unsafe { std::slice::from_raw_parts(self as *const Int96 as *const u32, 3) } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not safe because the ordering of Rust structs is not guaranteed.
Thank you @rahulketch Here is a related issue PR in Spark to stop writing INT96 timestamps I am kind of confused about the current status of Int96 -- the parquet spec says they are deprecated but spark keeps writing them and this PR (and others) seem to imply Spark / Databricks plans to keep writing INT96 timestamps indefinitely. Here is a related mailing list discussion on this topic: https://lists.apache.org/thread/6fm50b3pmh6mz659jb5wx5vzmvwccz1n As @emkornfield pointed out on that discussion, the spec explicitly says the sort order for INT96 types is undefined: Perhaps we should also update the spec to reflect whatever is desired as part of change the parquet writers? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry if I missed it but does Rust RS already filter have an allow/deny list for int96 stats when reading them? If not it seems like that needs to be added to the PR (or we can wait to resolve the higher level concerns on the mailing list)
I am not aware of any such feature, but I am not quite sure what an The low level API will return the values as written, see https://docs.rs/parquet/latest/parquet/file/statistics/enum.Statistics.html Specifically it will return |
I think @emkornfield means will parquet-rs ignore arrow-rs/parquet/src/arrow/arrow_writer/mod.rs Lines 1141 to 1143 in 1bed04c
AFAICT the record API will, and the statistics written will be ordered as On read I believe both will treat statistics "properly" (i.e. the In the short term it might be best to have this crate mimic parquet-java and ignore |
@etseidl had a good summary, https://github.com/apache/parquet-java/pull/3243/files#r2155295422 effectively we want to look at writer versions to determine if statistics are valid. See the corresponding java PR as an example. |
Which issue does this PR close?
Rationale for this change
int96 min/max statistics emitted by arrow-rs are incorrect.
What changes are included in this PR?
Not included in this PR:
Are there any user-facing changes?
The int96 min/max statistics will be different and correct.