Skip to content

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

dengziming
Copy link
Member

@dengziming dengziming commented Jul 28, 2025

What changes were proposed in this pull request?

When pushing down join SQL, we generated aliases for duplicated names, but the aliases are too long to read.

Before this change:

SELECT "ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0","AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a","ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d","ID","AMOUNT","ADDRESS" FROM xxxx     

RelationV2[ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0#18, AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a#19, ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d#20, ID#21, AMOUNT#22, ADDRESS#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

After this change.

SELECT "ID","AMOUNT","ADDRESS","ID_1","AMOUNT_1","ADDRESS_1" FROM xxx   

RelationV2[ID#18, AMOUNT#19, ADDRESS#20, ID_1#21, AMOUNT_1#22, ADDRESS_1#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

Why are the changes needed?

Make code-generated JDBC SQL clearer and deterministic.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests can ensure no side effects are introduced.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Trae.

@github-actions github-actions bot added the SQL label Jul 28, 2025
@dengziming
Copy link
Member Author

cc @PetarVasiljevic-DB

Copy link
Contributor

@PetarVasiljevic-DB PetarVasiljevic-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

@dengziming
Copy link
Member Author

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

Good catch @PetarVasiljevic-DB , let me think another way.

// suffix for more attempts.
var candidate = name
// Ensure candidate alias is unique by checking against existing names.
while (allClaimedAliases.contains(candidate)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can become too expensive, no? The complexity is O(columnCount ^ 2) in worst case and I have seen users having 1000 of columns in their table. So for the following worst-case scenario:
col, col_0, col_1, col_2, .... col_999 join col, col_0, col_1, col_2, .... col_999, this would have million operations.

what are your thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PetarVasiljevic-DB This is a very subtle case. I considered it first, but I thought it was O(n) so I ignored it. Now I add an aliasSuffixIndex to avoid O(n^2) case. PTAL again.

Copy link
Contributor

@PetarVasiljevic-DB PetarVasiljevic-DB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, the generated text is much clearer, and more importantly, it is deterministic now. Thanks for the change!

By the way, could we move generateColumnAliasesForDuplicatedName under the pushdownJoin. Or above, doesn't really matter, I just find it too big have it as a nested method.


// Generate candidate alias: use original name for the first attempt, then append
// suffix for more attempts.
var candidate = if (attempt == 0) name else s"${name}_$attempt"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can't we remove this check, and make it more simpler like
var candidate = name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this so we can try to keep more names unchanged, if we have col, col, col, then the result can be col, col_1, col_2, and the first col doesn't need an alias. I made a small improvement in the latest commit, PTAL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Copy link
Member Author

@dengziming dengziming Jul 29, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misunderstood your idea, but I tried your idea locally and the result was unexpected in the below picture(it begins with b_2 instead of b_1) , we can also make some adjustments based on it, but the LOC gain is not great.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants