-
Notifications
You must be signed in to change notification settings - Fork 49
(minor improvement): small refactoring in make_token_replica_map() #513
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@mykaul , let me know when you done, it looks good as is. |
I'm not, I'd like to understand the code better. I think it can be further slightly improved. |
While trying to look at some random (flaky?) test (scylladb#510 (comment) ) I saw some (minor) improvements that can be made to make_token_replica_map(): 1. Remove some redundant len() calls to outside the loop(s) 2. Align some variable names, start with num_ ... for them. 3. Move token_offset and host assignment within the loop to closer to where it's used. 4. Only add DCs and hosts that are in the map All those are probably very very minor improvements, perhaps in a large cluster it'll be noticable. Signed-off-by: Yaniv Kaul <[email protected]>
cd57c4d
to
64c0e38
Compare
if host.datacenter and host.rack: | ||
dc_racks[host.datacenter].add(host.rack) | ||
hosts_per_dc[host.datacenter].add(host) | ||
host_dc = host.datacenter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dkropachev - I've added this small change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the make_token_replica_map()
method in cassandra/metadata.py
to improve performance and code clarity through several optimizations:
- Pre-filter hosts to only process those in datacenters with replication factors
- Move length calculations outside loops to avoid redundant computations
- Improve variable naming consistency with
num_
prefixes - Reorganize code structure for better locality of use
if token_offset_index >= len(token_offsets): | ||
token_offset_index = token_offset_index - len(token_offsets) | ||
if replicas_remaining == 0 or num_replicas_this_dc == num_hosts_this_dc: | ||
break |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moving this check before the token_offset calculation creates a logic error. The original code checked this condition after getting the host, which ensures we don't break the loop prematurely when we haven't actually found enough unique hosts yet. This change could cause the algorithm to stop searching for replicas too early.
break |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this. The whole point is to optionall save useless assignment of token_offset_index if it's not needed in the cycle - and I assume at some cycle it wouldn't be.
dc_racks[host.datacenter].add(host.rack) | ||
hosts_per_dc[host.datacenter].add(host) | ||
host_dc = host.datacenter | ||
if host_dc in dc_rf_map: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This filtering logic is good for performance, but it changes the behavior of the original code. The original code populated dc_to_token_offset for all datacenters, then filtered later. This early filtering means dc_to_token_offset will only contain DCs that are in dc_rf_map, which could affect downstream logic that expects all DCs to be present.
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Huh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot means that your change may affect other places that may assume that dc_to_token_offset isn't filtered.
I guess it can't check it by "himself" as part of review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The whole point is to skip adding what will be later skipped (see deleted lines 647-648 below)
@scylladb/python-driver-maint - please re-review. I hope I did not mess things up - CoPilot is being silly here, I suspect. I'm not sure why the whole CI is broken, doesn't seem to be caused by my changes. |
While trying to look at some random (flaky?) test (#510 (comment) ) I saw some (minor) improvements that can be made to make_token_replica_map():
All those are probably very very minor improvements, perhaps in a large cluster it'll be noticable.
Pre-review checklist
./docs/source/
.Fixes:
annotations to PR description.