[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

dengziming · 2025-07-28T12:16:59Z

What changes were proposed in this pull request?

When pushing down join SQL, we generated aliases for duplicated names, but the aliases are too long to read.

Before this change:

SELECT "ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0","AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a","ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d","ID","AMOUNT","ADDRESS" FROM xxxx     

RelationV2[ID_bf822dc6_e06d_492c_a489_1e92a6fe84a0#18, AMOUNT_c9f3fc67_62f8_4ec6_9c3f_b7ee7bafcb5a#19, ADDRESS_d937a313_3e09_4b97_b91f_b2a47ef5e31d#20, ID#21, AMOUNT#22, ADDRESS#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

After this change.

SELECT "ID","AMOUNT","ADDRESS","ID_1","AMOUNT_1","ADDRESS_1" FROM xxx   

RelationV2[ID#18, AMOUNT#19, ADDRESS#20, ID_1#21, AMOUNT_1#22, ADDRESS_1#23] join_pushdown_catalog.JOIN_SCHEMA.JOIN_TABLE_1

Why are the changes needed?

Make code-generated JDBC SQL clearer and deterministic.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Existing tests can ensure no side effects are introduced.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Trae.

dengziming · 2025-07-28T12:22:18Z

cc @PetarVasiljevic-DB

PetarVasiljevic-DB

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

dengziming · 2025-07-28T12:47:18Z

@dengziming this way, if you have left side column COL and right side columns COL, COL_0, alias generator will generate COL_0 which would conflict with COL_0 from right side.

Good catch @PetarVasiljevic-DB , let me think another way.

PetarVasiljevic-DB · 2025-07-29T08:22:30Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+            // suffix for more attempts.
+            var candidate = name
+            // Ensure candidate alias is unique by checking against existing names.
+            while (allClaimedAliases.contains(candidate)) {


this can become too expensive, no? The complexity is O(columnCount ^ 2) in worst case and I have seen users having 1000 of columns in their table. So for the following worst-case scenario:
col, col_0, col_1, col_2, .... col_999 join col, col_0, col_1, col_2, .... col_999, this would have million operations.

what are your thoughts?

@PetarVasiljevic-DB This is a very subtle case. I considered it first, but I thought it was O(n) so I ignored it. Now I add an aliasSuffixIndex to avoid O(n^2) case. PTAL again.

PetarVasiljevic-DB

LGTM, the generated text is much clearer, and more importantly, it is deterministic now. Thanks for the change!

By the way, could we move generateColumnAliasesForDuplicatedName under the pushdownJoin. Or above, doesn't really matter, I just find it too big have it as a nested method.

abhiips07 · 2025-07-29T10:40:47Z

...re/src/main/scala/org/apache/spark/sql/execution/datasources/v2/V2ScanRelationPushDown.scala

+
+        // Generate candidate alias: use original name for the first attempt, then append
+        // suffix for more attempts.
+        var candidate = if (attempt == 0) name else s"${name}_$attempt"


can't we remove this check, and make it more simpler like
var candidate = name?

I did this so we can try to keep more names unchanged, if we have col, col, col, then the result can be col, col_1, col_2, and the first col doesn't need an alias. I made a small improvement in the latest commit, PTAL.

I think I misunderstood your idea, but I tried your idea locally and the result was unexpected in the below picture(it begins with b_2 instead of b_1) , we can also make some adjustments based on it, but the LOC gain is not great.

[SPARK-52975][SQL] Simplify field names in pushdown join sql

f4c0d73

github-actions bot added the SQL label Jul 28, 2025

PetarVasiljevic-DB suggested changes Jul 28, 2025

View reviewed changes

dengziming added 2 commits July 29, 2025 11:48

More deterministic way

1a63e31

More deterministic way

eee3692

PetarVasiljevic-DB reviewed Jul 29, 2025

View reviewed changes

Avoid o(n^2) worst case

6cfea82

PetarVasiljevic-DB approved these changes Jul 29, 2025

View reviewed changes

refactor: move big method out.

3241b26

dengziming force-pushed the SPARK-52975 branch from c0dd5f0 to 3241b26 Compare July 29, 2025 10:04

PetarVasiljevic-DB approved these changes Jul 29, 2025

View reviewed changes

abhiips07 reviewed Jul 29, 2025

View reviewed changes

more improvement

5de7386

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

dengziming commented Jul 28, 2025 •

edited

Loading

Uh oh!

dengziming commented Jul 28, 2025

Uh oh!

PetarVasiljevic-DB left a comment •

edited

Loading

Uh oh!

dengziming commented Jul 28, 2025

Uh oh!

PetarVasiljevic-DB Jul 29, 2025

Uh oh!

dengziming Jul 29, 2025

Uh oh!

PetarVasiljevic-DB left a comment

Uh oh!

abhiips07 Jul 29, 2025

Uh oh!

dengziming Jul 29, 2025

Uh oh!

abhiips07 Jul 29, 2025

Uh oh!

dengziming Jul 29, 2025 •

edited

Loading

Uh oh!

Uh oh!

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

Are you sure you want to change the base?

[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686

Conversation

dengziming commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

dengziming commented Jul 28, 2025

Uh oh!

PetarVasiljevic-DB left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming commented Jul 28, 2025

Uh oh!

PetarVasiljevic-DB Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

dengziming Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

PetarVasiljevic-DB left a comment

Choose a reason for hiding this comment

Uh oh!

abhiips07 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

dengziming Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

abhiips07 Jul 29, 2025

Choose a reason for hiding this comment

Uh oh!

dengziming Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dengziming commented Jul 28, 2025 •

edited

Loading

PetarVasiljevic-DB left a comment •

edited

Loading

dengziming Jul 29, 2025 •

edited

Loading