Skip to content

Improve dictionary null handling in hashing and expand aggregate test coverage for nulls #16458

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 7,968 commits into from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Jun 19, 2025

Which issue does this PR close?

Rationale for this change

This change addresses a bug where combine_hashes was applied even if a dictionary value was null, leading to incorrect hash computations.
This was discovered while investigating #16266
Additionally, this PR extends the test coverage for aggregate functions to better validate behavior with dictionary arrays containing nulls.

What changes are included in this PR?

  • Fixes logic in hash_dictionary to ensure combine_hashes is only applied when the dictionary value is valid.
  • Corrects grammar in error messages for dataset generation expectations.
  • Enables null value generation in fuzz tests for dictionary arrays.
  • Adds comprehensive tests for aggregate functions (COUNT, SUM, MIN, MAX, MEDIAN, FIRST_VALUE, LAST_VALUE) using dictionary arrays with null keys and values.
  • Ensures consistent behavior across single and multi-partition execution.

Are these changes tested?

Yes, extensive new tests are added covering:

  • Aggregates on dictionary columns with null keys/values.
  • Window functions with null handling (IGNORE/RESPECT NULLS).
  • Partitioned vs. unpartitioned execution consistency.

Are there any user-facing changes?

No direct API changes, but query behavior involving dictionary arrays with nulls will now produce correct and consistent results in line with SQL semantics.

the0ninjas and others added 30 commits May 12, 2025 13:48
* Updated extending operators documentation

* commented out Rust code to pass doc test

---------

Co-authored-by: Andrew Lamb <[email protected]>
* feat(proto): udf decoding fallback

* add test case for proto udf decode fallback
* Replace MSRV link on main page with Github badge
Bumps [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) from 0.28.1 to 0.28.2.
- [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases)
- [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md)
- [Commits](risinglightdb/sqllogictest-rs@v0.28.1...v0.28.2)

---
updated-dependencies:
- dependency-name: sqllogictest
  dependency-version: 0.28.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add lint rule to enforce string formatting style

* format

* extra

* Update datafusion/ffi/src/tests/async_provider.rs

Co-authored-by: kosiew <[email protected]>

* Update datafusion/functions/src/datetime/to_date.rs

Co-authored-by: kosiew <[email protected]>

---------

Co-authored-by: kosiew <[email protected]>
…ache#16039)

* Docs: Add example of creating a field in `return_field_from_args`

* fmt

* Update datafusion/expr/src/udf.rs

Co-authored-by: Oleks V <[email protected]>

* fmt

---------

Co-authored-by: Oleks V <[email protected]>
* Fix comparisons between lists that contain nulls

* Add support for lists in min/max agg functions

* Add sqllogictests

* Support lists in window frame target type
When aggregating first/last list over a column of lists, the first/last
accumulators hold the necessary scalar value as is, which points to the
list in the original input buffer.

This results in two issues:

1) We prevent the deallocation of the input arrays which might be
significantly larger than the single value we want to hold.

2) During aggreagtion with groups, many accumulators receive slices of the
same input buffer, resulting in all held values pointing to this buffer.
Then, when calculating the size of all accumulators we count the buffer
multiple times, since each accumulator considers it to be part of its own
allocation.
* Improve docs for Exprs and scalar functions

* fix links
* h2o-window benchmark

* Review: clarify h2o-window is an extended benchmark
* draft commit to rolledback changes on function naming and include prepare clause on the infer types tests

* include data types in plan when it is not included in the prepare statement

* fix: prepare statement error

* Update datafusion/sql/src/statement.rs

Co-authored-by: Andrew Lamb <[email protected]>

* remove infer types from prepare statement

the infer data type changes in statement will be introduced in a new PR

* fix to show correct output message

* remove white space

* Restore the original tests too

---------

Co-authored-by: Andrew Lamb <[email protected]>
* style: simplify some strings for readability

* fix: formatting in `datafusion/` directory

* refactor: replace long `format!` string

* refactor: replace `format!` with `assert_eq!`

---------

Co-authored-by: Andrew Lamb <[email protected]>
* support simple lateral joins

Signed-off-by: Alex Chi Z <[email protected]>

* fix explain test

Signed-off-by: Alex Chi Z <[email protected]>

* plan scalar agg correctly

Signed-off-by: Alex Chi Z <[email protected]>

* add uncorrelated query tests

Signed-off-by: Alex Chi Z <[email protected]>

* fix clippy + fmt

Signed-off-by: Alex Chi Z <[email protected]>

* make rule matching faster

Signed-off-by: Alex Chi Z <[email protected]>

* revert build_join visibility

Signed-off-by: Alex Chi Z <[email protected]>

* revert find plan outer column changes

Signed-off-by: Alex Chi Z <[email protected]>

* remove clone

* address comment

---------

Signed-off-by: Alex Chi Z <[email protected]>
Co-authored-by: Alex Chi Z <[email protected]>
* chore(deps): bump the arrow-parquet group with 7 updates

Bumps the arrow-parquet group with 7 updates:

| Package | From | To |
| --- | --- | --- |
| [arrow](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-buffer](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-flight](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-ipc](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-ord](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [arrow-schema](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |
| [parquet](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` |


Updates `arrow` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

Updates `arrow-buffer` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

Updates `arrow-flight` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

Updates `arrow-ipc` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

Updates `arrow-ord` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

Updates `arrow-schema` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

Updates `parquet` from 55.0.0 to 55.1.0
- [Release notes](https://github.com/apache/arrow-rs/releases)
- [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md)
- [Commits](apache/arrow-rs@55.0.0...55.1.0)

---
updated-dependencies:
- dependency-name: arrow
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-buffer
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-flight
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-ipc
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-ord
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: arrow-schema
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
- dependency-name: parquet
  dependency-version: 55.1.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
  dependency-group: arrow-parquet
...

Signed-off-by: dependabot[bot] <[email protected]>

* Update sqllogictest results

---------

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Andrew Lamb <[email protected]>
Bumps [petgraph](https://github.com/petgraph/petgraph) from 0.7.1 to 0.8.1.
- [Release notes](https://github.com/petgraph/petgraph/releases)
- [Changelog](https://github.com/petgraph/petgraph/blob/master/CHANGELOG.md)
- [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.8.1)

---
updated-dependencies:
- dependency-name: petgraph
  dependency-version: 0.8.1
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add Spark-compatible char expression

* Add slt test
* update window function

* pretier fix

* Update window_functions.md
Bumps [substrait](https://github.com/substrait-io/substrait-rs) from 0.55.1 to 0.56.0.
- [Release notes](https://github.com/substrait-io/substrait-rs/releases)
- [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md)
- [Commits](substrait-io/substrait-rs@v0.55.1...v0.56.0)

---
updated-dependencies:
- dependency-name: substrait
  dependency-version: 0.56.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
`TempDir::into_path` "leaks" the temp dir. This updates the `tempfile`
crate to a version where this method is deprecated and fixes all usages.
@github-actions github-actions bot added documentation Improvements or additions to documentation sql SQL Planner development-process Related to development process of DataFusion logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate catalog Related to the catalog crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation datasource Changes to the datasource crate ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Jun 20, 2025
@kosiew
Copy link
Contributor Author

kosiew commented Jun 20, 2025

Removing a large file from the commit history (test_data.txt) with git filter-repo messed up this branch.
Closing

@kosiew kosiew closed this Jun 20, 2025
@kosiew
Copy link
Contributor Author

kosiew commented Jun 20, 2025

Replaced with #16466

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate development-process Related to development process of DataFusion documentation Improvements or additions to documentation execution Related to the execution crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate spark sql SQL Planner sqllogictest SQL Logic Tests (.slt) substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Fuzz tests to include Dict with null values