Skip to content

GH-3242: Emit and Read min/max statistics for int96 timestamp columns #3243

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

rahulketch
Copy link
Contributor

@rahulketch rahulketch commented Jun 12, 2025

Rationale for this change

Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.

What changes are included in this PR?

  1. Defining column ordering for int96
  2. Implementation of comparator for int96
  3. A flag parquet.read.int96stats.enabled to control if the stats are read. It is defaulted to true
  4. ValidInt96Stats: Reads stats from known good writers. Currently including:
    a. parquet-mr
    b. photon
  5. Fixing tests which fail due to this change

Are these changes tested?

  1. Added more unit tests for the new behaviour

Are there any user-facing changes?

  1. toParquetStatistics and fromParquetStatistics are no longer static functions. It doesn't look like there is a good reason for these functions to be static.
  2. Signatures of some of the constructors of ParquetMetadataConverter is not different due to an added parameter.

Closes #3242

@emkornfield
Copy link
Contributor

It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this.

bb2.order(java.nio.ByteOrder.LITTLE_ENDIAN);
int jd1 = bb1.getInt(8);
int jd2 = bb2.getInt(8);
if (jd1 != jd2) return Integer.compareUnsigned(jd1, jd2) < 0 ? -1 : 1;
Copy link
Contributor

@emkornfield emkornfield Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be signed comparison? Year 0001-01-01 is in the wild, and I think that would be a negative value for days since julien epoch?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind, I think this is actually counting from something pre-year 1.

ByteBuffer bb1 = b1.toByteBuffer();
ByteBuffer bb2 = b2.toByteBuffer();
bb1.order(java.nio.ByteOrder.LITTLE_ENDIAN);
bb2.order(java.nio.ByteOrder.LITTLE_ENDIAN);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this logic should probably be commented.

/*
* This comparator is for comparing two timestamps represented as int96 binary.
*/
static final PrimitiveComparator<Binary> BINARY_AS_INT_96_COMPARATOR = new BinaryComparator() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be AS_INT96_TIMESTAMP? This is not a comparison that would work for int96.

try {
ParsedVersion version = VersionParser.parse(createdBy);
if ("parquet-mr".equals(version.application)) {
return true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't this cause all previous stats to be valid? Do we know that is the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously parquet-mr does not emit stats for int96 columns. After this PR is merged int96 stats will be emitted so they're safe to read.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Were int96 stats ever emitted? I don't remember.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Emit and Read min/max statistics for int96 timestamp columns
3 participants