-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Improve dictionary null handling in hashing and expand aggregate test coverage for nulls #16458
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
* Updated extending operators documentation * commented out Rust code to pass doc test --------- Co-authored-by: Andrew Lamb <[email protected]>
* feat(proto): udf decoding fallback * add test case for proto udf decode fallback
* Replace MSRV link on main page with Github badge
…CsvExec`, `JsonExec` (apache#16034)
Co-authored-by: Andrew Lamb <[email protected]>
Bumps [sqllogictest](https://github.com/risinglightdb/sqllogictest-rs) from 0.28.1 to 0.28.2. - [Release notes](https://github.com/risinglightdb/sqllogictest-rs/releases) - [Changelog](https://github.com/risinglightdb/sqllogictest-rs/blob/main/CHANGELOG.md) - [Commits](risinglightdb/sqllogictest-rs@v0.28.1...v0.28.2) --- updated-dependencies: - dependency-name: sqllogictest dependency-version: 0.28.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add lint rule to enforce string formatting style * format * extra * Update datafusion/ffi/src/tests/async_provider.rs Co-authored-by: kosiew <[email protected]> * Update datafusion/functions/src/datetime/to_date.rs Co-authored-by: kosiew <[email protected]> --------- Co-authored-by: kosiew <[email protected]>
…ache#16039) * Docs: Add example of creating a field in `return_field_from_args` * fmt * Update datafusion/expr/src/udf.rs Co-authored-by: Oleks V <[email protected]> * fmt --------- Co-authored-by: Oleks V <[email protected]>
* Fix comparisons between lists that contain nulls * Add support for lists in min/max agg functions * Add sqllogictests * Support lists in window frame target type
When aggregating first/last list over a column of lists, the first/last accumulators hold the necessary scalar value as is, which points to the list in the original input buffer. This results in two issues: 1) We prevent the deallocation of the input arrays which might be significantly larger than the single value we want to hold. 2) During aggreagtion with groups, many accumulators receive slices of the same input buffer, resulting in all held values pointing to this buffer. Then, when calculating the size of all accumulators we count the buffer multiple times, since each accumulator considers it to be part of its own allocation.
* Improve docs for Exprs and scalar functions * fix links
* h2o-window benchmark * Review: clarify h2o-window is an extended benchmark
Signed-off-by: Ruihang Xia <[email protected]>
* draft commit to rolledback changes on function naming and include prepare clause on the infer types tests * include data types in plan when it is not included in the prepare statement * fix: prepare statement error * Update datafusion/sql/src/statement.rs Co-authored-by: Andrew Lamb <[email protected]> * remove infer types from prepare statement the infer data type changes in statement will be introduced in a new PR * fix to show correct output message * remove white space * Restore the original tests too --------- Co-authored-by: Andrew Lamb <[email protected]>
* style: simplify some strings for readability * fix: formatting in `datafusion/` directory * refactor: replace long `format!` string * refactor: replace `format!` with `assert_eq!` --------- Co-authored-by: Andrew Lamb <[email protected]>
* support simple lateral joins Signed-off-by: Alex Chi Z <[email protected]> * fix explain test Signed-off-by: Alex Chi Z <[email protected]> * plan scalar agg correctly Signed-off-by: Alex Chi Z <[email protected]> * add uncorrelated query tests Signed-off-by: Alex Chi Z <[email protected]> * fix clippy + fmt Signed-off-by: Alex Chi Z <[email protected]> * make rule matching faster Signed-off-by: Alex Chi Z <[email protected]> * revert build_join visibility Signed-off-by: Alex Chi Z <[email protected]> * revert find plan outer column changes Signed-off-by: Alex Chi Z <[email protected]> * remove clone * address comment --------- Signed-off-by: Alex Chi Z <[email protected]> Co-authored-by: Alex Chi Z <[email protected]>
* chore(deps): bump the arrow-parquet group with 7 updates Bumps the arrow-parquet group with 7 updates: | Package | From | To | | --- | --- | --- | | [arrow](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | | [arrow-buffer](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | | [arrow-flight](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | | [arrow-ipc](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | | [arrow-ord](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | | [arrow-schema](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | | [parquet](https://github.com/apache/arrow-rs) | `55.0.0` | `55.1.0` | Updates `arrow` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) Updates `arrow-buffer` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) Updates `arrow-flight` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) Updates `arrow-ipc` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) Updates `arrow-ord` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) Updates `arrow-schema` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) Updates `parquet` from 55.0.0 to 55.1.0 - [Release notes](https://github.com/apache/arrow-rs/releases) - [Changelog](https://github.com/apache/arrow-rs/blob/main/CHANGELOG-old.md) - [Commits](apache/arrow-rs@55.0.0...55.1.0) --- updated-dependencies: - dependency-name: arrow dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet - dependency-name: arrow-buffer dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet - dependency-name: arrow-flight dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet - dependency-name: arrow-ipc dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet - dependency-name: arrow-ord dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet - dependency-name: arrow-schema dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet - dependency-name: parquet dependency-version: 55.1.0 dependency-type: direct:production update-type: version-update:semver-minor dependency-group: arrow-parquet ... Signed-off-by: dependabot[bot] <[email protected]> * Update sqllogictest results --------- Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Andrew Lamb <[email protected]>
Bumps [petgraph](https://github.com/petgraph/petgraph) from 0.7.1 to 0.8.1. - [Release notes](https://github.com/petgraph/petgraph/releases) - [Changelog](https://github.com/petgraph/petgraph/blob/master/CHANGELOG.md) - [Commits](https://github.com/petgraph/petgraph/compare/[email protected]@v0.8.1) --- updated-dependencies: - dependency-name: petgraph dependency-version: 0.8.1 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add Spark-compatible char expression * Add slt test
* update window function * pretier fix * Update window_functions.md
Bumps [substrait](https://github.com/substrait-io/substrait-rs) from 0.55.1 to 0.56.0. - [Release notes](https://github.com/substrait-io/substrait-rs/releases) - [Changelog](https://github.com/substrait-io/substrait-rs/blob/main/CHANGELOG.md) - [Commits](substrait-io/substrait-rs@v0.55.1...v0.56.0) --- updated-dependencies: - dependency-name: substrait dependency-version: 0.56.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
`TempDir::into_path` "leaks" the temp dir. This updates the `tempfile` crate to a version where this method is deprecated and fixes all usages.
…tionary key hashes
Removing a large file from the commit history (test_data.txt) with git filter-repo messed up this branch. |
Replaced with #16466 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
catalog
Related to the catalog crate
common
Related to common crate
core
Core DataFusion crate
datasource
Changes to the datasource crate
development-process
Related to development process of DataFusion
documentation
Improvements or additions to documentation
execution
Related to the execution crate
ffi
Changes to the ffi crate
functions
Changes to functions implementation
logical-expr
Logical plan and expressions
optimizer
Optimizer rules
physical-expr
Changes to the physical-expr crates
physical-plan
Changes to the physical-plan crate
proto
Related to proto crate
spark
sql
SQL Planner
sqllogictest
SQL Logic Tests (.slt)
substrait
Changes to the substrait crate
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
This change addresses a bug where
combine_hashes
was applied even if a dictionary value was null, leading to incorrect hash computations.This was discovered while investigating #16266
Additionally, this PR extends the test coverage for aggregate functions to better validate behavior with dictionary arrays containing nulls.
What changes are included in this PR?
hash_dictionary
to ensurecombine_hashes
is only applied when the dictionary value is valid.COUNT
,SUM
,MIN
,MAX
,MEDIAN
,FIRST_VALUE
,LAST_VALUE
) using dictionary arrays with null keys and values.Are these changes tested?
Yes, extensive new tests are added covering:
Are there any user-facing changes?
No direct API changes, but query behavior involving dictionary arrays with nulls will now produce correct and consistent results in line with SQL semantics.