Skip to content

Improve dictionary null handling in hashing and expand aggregate test coverage for nulls #16466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 48 commits into
base: main
Choose a base branch
from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 20, 2025

Which issue does this PR close?

Rationale for this change

This change addresses a bug where combine_hashes was applied even if a dictionary value was null, leading to incorrect hash computations.
This was discovered while investigating #16266
Additionally, this PR extends the test coverage for aggregate functions to better validate behavior with dictionary arrays containing nulls.

What changes are included in this PR?

  • Fixes logic in hash_dictionary to ensure combine_hashes is only applied when the dictionary value is valid.
  • Corrects grammar in error messages for dataset generation expectations.
  • Enables null value generation in fuzz tests for dictionary arrays.
  • Adds comprehensive tests for aggregate functions (COUNT, SUM, MIN, MAX, MEDIAN, FIRST_VALUE, LAST_VALUE) using dictionary arrays with null keys and values.
  • Ensures consistent behavior across single and multi-partition execution.

Are these changes tested?

Yes, extensive new tests are added covering:

  • Aggregates on dictionary columns with null keys/values.
  • Window functions with null handling (IGNORE/RESPECT NULLS).
  • Partitioned vs. unpartitioned execution consistency.

Are there any user-facing changes?

No direct API changes, but query behavior involving dictionary arrays with nulls will now produce correct and consistent results in line with SQL semantics.

kosiew added 30 commits June 17, 2025 20:29
…h GROUP BY on dictionary columns, including null keys and values
…h GROUP BY on dictionary columns, including single and multiple partition scenarios
…s in dictionary columns across various aggregate functions
…c types and dictionary columns containing nulls
@github-actions github-actions bot added core Core DataFusion crate common Related to common crate labels Jun 20, 2025
@kosiew
Copy link
Contributor Author

kosiew commented Jun 20, 2025

image

datafusion/common/src/hash_utils.rs | 50 +-
.../tests/fuzz_cases/aggregation_fuzzer/fuzzer.rs | 8 +-
.../tests/fuzz_cases/record_batch_generator.rs | 6 +-

datafusion/core/tests/sql/aggregates.rs | 1442 ++++++++++++++++++++
4 files changed, 1483 insertions(+), 23 deletions(-)

In working on #16266, there was initially a lot of fuzz failures.
I added a lot of tests to reproduce the failures and the bulk of the changes in this PR are the additional tests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Fuzz tests to include Dict with null values
1 participant