-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-52975][SQL] Simplify field names in pushdown join sql #51686
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dengziming this way, if you have left side column COL
and right side columns COL, COL_0
, alias generator will generate COL_0
which would conflict with COL_0
from right side.
Good catch @PetarVasiljevic-DB , let me think another way. |
// suffix for more attempts. | ||
var candidate = name | ||
// Ensure candidate alias is unique by checking against existing names. | ||
while (allClaimedAliases.contains(candidate)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can become too expensive, no? The complexity is O(columnCount ^ 2)
in worst case and I have seen users having 1000 of columns in their table. So for the following worst-case scenario:
col, col_0, col_1, col_2, .... col_999
join col, col_0, col_1, col_2, .... col_999
, this would have million operations.
what are your thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PetarVasiljevic-DB This is a very subtle case. I considered it first, but I thought it was O(n) so I ignored it. Now I add an aliasSuffixIndex
to avoid O(n^2) case. PTAL again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, the generated text is much clearer, and more importantly, it is deterministic now. Thanks for the change!
By the way, could we move generateColumnAliasesForDuplicatedName
under the pushdownJoin
. Or above, doesn't really matter, I just find it too big have it as a nested method.
|
||
// Generate candidate alias: use original name for the first attempt, then append | ||
// suffix for more attempts. | ||
var candidate = if (attempt == 0) name else s"${name}_$attempt" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can't we remove this check, and make it more simpler like
var candidate = name
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this so we can try to keep more names unchanged, if we have col, col, col
, then the result can be col, col_1, col_2
, and the first col doesn't need an alias. I made a small improvement in the latest commit, PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
When pushing down join SQL, we generated aliases for duplicated names, but the aliases are too long to read.
Before this change:
After this change.
Why are the changes needed?
Make code-generated JDBC SQL clearer and deterministic.
Does this PR introduce any user-facing change?
No
How was this patch tested?
Existing tests can ensure no side effects are introduced.
Was this patch authored or co-authored using generative AI tooling?
Generated-by: Trae.