Skip to content

Fix duplicate field name error in Join::try_new_with_project_input during physical planning #16454

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

LiaCastaneda
Copy link
Contributor

@LiaCastaneda LiaCastaneda commented Jun 19, 2025

Which issue does this PR close?

  • Closes #.

Rationale for this change

There have been some occasions where substrait queries fail during logical planning because a node's resulting schema contains duplicate field names. This is not allowed by either Arrow or DataFusion (always verified on check_names upon schema creation). This occurred most commonly when constructing the schema of joins, in inner joins where both the left and right inputs have a field with the same name, resulting in duplicate fields in the output schema.

This was previously addressed in the Substrait consumer path in` Fix duplicate unqualified Field name (schema error) on join queries and there is an existing function that handles this kind of situations on joins rels: requalify_sides_if_needed

We are now encountering the same error again, but this time during physical planning while mutating the Logical plan, rather than in logical planning phase itself.

Specifically it arises on this line when the join key is an non Column expression, since it has to create a Projection with a Column expr on top (by calling wrap_projection_for_join_if_necessary) to ensure correct execution iiuc. The Join logical node is then updated to use this new column. The issue occurs when the new logical node is built using Join::try_new_with_project_input as the function that constructs the join schema checks that names are unique as well.

What changes are included in this PR?

My proposed workaround is to qualify the sides of the join on try_new_with_project_input by calling the same function the substrait consumer uses: requalify_sides_if_needed, but I'm open to suggestions 🙇‍♀️

Are these changes tested?

Are there any user-facing changes?

No.

@github-actions github-actions bot added logical-expr Logical plan and expressions core Core DataFusion crate substrait Changes to the substrait crate common Related to common crate labels Jun 19, 2025
@LiaCastaneda LiaCastaneda changed the title Fix duplicates on Join creation during physcial planning Fix duplicates on logical Join creation during physical planning Jun 19, 2025
@LiaCastaneda LiaCastaneda changed the title Fix duplicates on logical Join creation during physical planning Fix duplicate field name error in Join::try_new_with_project_input during physical planning Jun 20, 2025
@LiaCastaneda LiaCastaneda force-pushed the lia/fix-duplicate-unqualified-from-physcial-planning-error branch 2 times, most recently from 97caf0f to fb9d758 Compare June 20, 2025 11:00
/// This is especially useful for queries that come as Substrait, since Substrait doesn't currently allow specifying
/// aliases, neither for columns nor for tables. DataFusion requires columns to be uniquely identifiable, in some
/// places (see e.g. DFSchema::check_names).
pub fn requalify_sides_if_needed(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved this helper function to the logical plan builder module since now its not used only by the substrait consumer but also by plan.rs.

@LiaCastaneda LiaCastaneda force-pushed the lia/fix-duplicate-unqualified-from-physcial-planning-error branch from fb9d758 to 6244060 Compare June 20, 2025 12:09
Comment on lines +975 to +987
// Re-qualify the join schema only if the inputs were previously requalified in
// `try_new_with_project_input`. This ensures that when building the Projection
// it can correctly resolve field nullability and data types
// by disambiguating fields from the left and right sides of the join.
let qualified_join_schema = if requalified {
Arc::new(qualify_join_schema_sides(
join_schema,
original_left,
original_right,
)?)
} else {
Arc::clone(join_schema)
};
Copy link
Contributor Author

@LiaCastaneda LiaCastaneda Jun 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale for qualifying the schema is that when building the logical Projection after, it will build the fields out of the expression names in exprlist_to_fields so it will look in new_join.schema() and try to match each expr to a field in the schema, if the expr::Column does not have a qualifier and there are multiple candidates Fields that could correspond to this expr::Column , we will get an ambiguity error, qualifying the schema allows us to prevent this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate logical-expr Logical plan and expressions substrait Changes to the substrait crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant