Skip to content

[ENH]: Propagate cancellations to NAC queue #5129

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: graphite-base/5129
Choose a base branch
from

Conversation

tanujnay112
Copy link
Contributor

@tanujnay112 tanujnay112 commented Jul 21, 2025

Description of changes

Summarize the changes made by this PR.

  • Improvements & Bug fixes
    • When a future running a NAC request gets dropped, it "removes" its priority from the pending request if it hasn't been sent out to object storage yet.
    • The priority data structure for in-flight requests has been updated from a binary value of priorities to a set of counters, denoting how many tasks requested each priority level.
  • New functionality
    • Ideally, the above changes should make it such that high priority NAC requests get lowered in priority when the calling requests are cancelled/timed out.

Test plan

How are these changes tested?

  • Tests pass locally with pytest for python, yarn test for js, cargo test for rust

Migration plan

N/A

Observability plan

It would be a good idea to add a metric that tracks the number of total waiters on the NAC layer.

Documentation Changes

Are all docstrings for user-facing APIs updated if required? Do we need to make documentation changes in the docs section?

Copy link
Contributor Author

tanujnay112 commented Jul 21, 2025

Copy link

Reviewer Checklist

Please leverage this checklist to ensure your code review is thorough before approving

Testing, Bugs, Errors, Logs, Documentation

  • Can you think of any use case in which the code does not behave as intended? Have they been tested?
  • Can you think of any inputs or external events that could break the code? Is user input validated and safe? Have they been tested?
  • If appropriate, are there adequate property based tests?
  • If appropriate, are there adequate unit tests?
  • Should any logging, debugging, tracing information be added or removed?
  • Are error messages user-friendly?
  • Have all documentation changes needed been made?
  • Have all non-obvious changes been commented?

System Compatibility

  • Are there any potential impacts on other parts of the system or backward compatibility?
  • Does this change intersect with any items on our roadmap, and if so, is there a plan for fitting them together?

Quality

  • Is this code of a unexpectedly high quality (Readability, Modularity, Intuitiveness)

@tanujnay112 tanujnay112 marked this pull request as ready for review July 21, 2025 04:04
Copy link
Contributor

propel-code-bot bot commented Jul 21, 2025

Propagate Cancellations to NAC Queue and Improve Priority Management

This PR introduces substantial changes to the admission control (NAC) layer for S3-backed storage. Cancellations of in-flight NAC requests are now properly propagated, so when a future is dropped (i.e., caller cancels or times out), its associated priority is removed from the pending request's priority counters. The handling of priorities has been overhauled: rather than a single binary priority flag, a thread-safe set of per-level counters (PriorityHolder) now tracks how many outstanding tasks require each priority level. This allows the effective priority of a request to be dynamically lowered if all high-priority waiters are cancelled. The refactor introduces several new types (e.g., PriorityHolder, RollbackPriorityOnDrop), modifies all relevant internal APIs to use them, and consistently updates the codebase to maintain correct lock and notification semantics.

Key Changes

• Replaced atomic priority flag with a PriorityHolder struct using atomic counters for each priority level.
• Added logic to remove (rollback) a request's priority upon cancellation or future drop, ensuring correct request de-prioritization.
• Refactored in-flight request map to store Arc to facilitate shared state for cancellation management.
• Updated APIs, including parallel reads, plain reads, writes, and semaphore-based rate limiting, to utilize the new priority management constructs.
• Implemented RollbackPriorityOnDrop RAII pattern for automatic cleanup of priority claims and notification of waiting tasks.
• Updated all usages of rate limiter admission and related APIs to operate on the new priority structures.

Affected Areas

• rust/storage/src/admissioncontrolleds3.rs
• Request deduplication and coalescing logic
• Priority and rate limiting for S3 operations
• Internal APIs and data structures for priority propagation and cleanup

This summary was automatically generated by @propel-code-bot

@tanujnay112 tanujnay112 force-pushed the nac_cancel branch 2 times, most recently from 074d5a5 to fa0aa59 Compare July 24, 2025 02:41
@tanujnay112 tanujnay112 requested a review from sanketkedia July 25, 2025 06:23
@tanujnay112 tanujnay112 changed the base branch from task_cancellation to graphite-base/5129 July 25, 2025 17:59
let current_priority: StorageRequestPriority = current_priority.into();
// Unwrap shouldn't fail here as update_priority should have been called before
// entering this.
let current_priority = priority.get_priority().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[BestPractice]

Inconsistent error handling: The unwrap() call on line 717 could cause a panic if get_priority() returns None. While the comment suggests this shouldn't happen, defensive programming would handle this case gracefully.

Suggested change
let current_priority = priority.get_priority().unwrap();
let current_priority = match priority.get_priority() {
Some(pri) => pri,
None => {
tracing::error!("Priority holder has no active priorities during acquire");
return self.remaining_tokens[StorageRequestPriority::lowest().as_usize()]
.acquire()
.await
.expect("Failed to acquire semaphore token");
}
};

Committable suggestion

Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants