-
Notifications
You must be signed in to change notification settings - Fork 1.5k
GH-3242: Emit and Read min/max statistics for int96 timestamp columns #3243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
It looks like Alkis started a discussion on the ML, but we probably want to come to a consensus and update https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 first before merging this. |
bb2.order(java.nio.ByteOrder.LITTLE_ENDIAN); | ||
int jd1 = bb1.getInt(8); | ||
int jd2 = bb2.getInt(8); | ||
if (jd1 != jd2) return Integer.compareUnsigned(jd1, jd2) < 0 ? -1 : 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this be signed comparison? Year 0001-01-01 is in the wild, and I think that would be a negative value for days since julien epoch?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Never mind, I think this is actually counting from something pre-year 1.
ByteBuffer bb1 = b1.toByteBuffer(); | ||
ByteBuffer bb2 = b2.toByteBuffer(); | ||
bb1.order(java.nio.ByteOrder.LITTLE_ENDIAN); | ||
bb2.order(java.nio.ByteOrder.LITTLE_ENDIAN); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this logic should probably be commented.
/* | ||
* This comparator is for comparing two timestamps represented as int96 binary. | ||
*/ | ||
static final PrimitiveComparator<Binary> BINARY_AS_INT_96_COMPARATOR = new BinaryComparator() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be AS_INT96_TIMESTAMP
? This is not a comparison that would work for int96.
try { | ||
ParsedVersion version = VersionParser.parse(createdBy); | ||
if ("parquet-mr".equals(version.application)) { | ||
return true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this cause all previous stats to be valid? Do we know that is the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously parquet-mr does not emit stats for int96 columns. After this PR is merged int96 stats will be emitted so they're safe to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were int96 stats ever emitted? I don't remember.
Rationale for this change
Parquet-java does not emit or read stats for int96 timestamp columns. Since int96 is used as the default timestamp in spark, this limits a lot of optimization opportunities. Engines like Photon populate the statistics for the int96 timestamps correctly. So parquet-java can also emit the statistics, and also allow reading these statistics from known good writers.
What changes are included in this PR?
parquet.read.int96stats.enabled
to control if the stats are read. It is defaulted to trueValidInt96Stats
: Reads stats from known good writers. Currently including:a.
parquet-mr
b.
photon
Are these changes tested?
Are there any user-facing changes?
toParquetStatistics
andfromParquetStatistics
are no longer static functions. It doesn't look like there is a good reason for these functions to be static.ParquetMetadataConverter
is not different due to an added parameter.Closes #3242