Skip to content

RFC: Delayed cancellation of timer #7384

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ADD-SP opened this issue Jun 4, 2025 · 1 comment
Open

RFC: Delayed cancellation of timer #7384

ADD-SP opened this issue Jun 4, 2025 · 1 comment
Labels
A-tokio Area: The main tokio crate C-proposal Category: a proposal and request for comments M-time Module: tokio/time

Comments

@ADD-SP
Copy link
Contributor

ADD-SP commented Jun 4, 2025

Summary

Please check out the #6504 for initial context.

This design reduces the lock contention of the time module, this improves the scalability for the async server to read/write async sockets with timeout.

  • Use a local wheel for each worker thread instead of a global wheel.
  • Do not register the timer into the wheel on creation.
  • Register timer into the local wheel on the first poll.
  • Cancel a timer by removing it from the local wheel directly if this timer is canceled by local thread.
  • Cancel a timer by sending the intrusive list entry using std::sync::mpsc to the worker thread that owns it if this timer is canceled by remote thread.

Current Implementation

You can skip it if you are already familiar with the codebase.

There is a global wheel protected by std::sync::Mutex, so we have to acquire the Mutex before registering and cancelling a timer.

Resetting a timer doesn't acquire the Mutex as it just updates an AtomicU64.

What's the problem?

Worker threads can concurrently process their owned sockets, this is scalable.

However, since timer is heavily used for socket timeout in async server/client, there are many lock contentions while read/write a async socket with timeout, which introduces much lock contentions.

What have we tried?

commit: time: use sharding for timer implementation

This commit splits the global wheel into several wheels (indexed by thread_id), and wraps it by Mutex in the Runtime,

  • To calculate the earliest timer, it locks all wheels one-by-one while parking the driver.
  • To register/cancel a timer, it locks the specific wheel and removes it.

Unfortunately, this was reverted by 1ae9434 due to the overhead of locking multiple Mutex in hot paths is much more expensive than the global lock contention.

How does the kernel handle this case?

I'm not a kernel developer, please correct me if I'm wrong.

For non-hrtimer, all timers are registered into a per-CPU timer wheel, and the kernel also locks the per-CPU timer wheel when registering/canceling the timer.

This looks just like what we did in time: use sharding for timer implementation (already been reverted).

However, I think the big difference is that kernel spinlocks the per-CPU wheel, this is much more efficient than user space Mutex since the kernel is able to control the kernel scheduler and interrupt.

With the benefits of spinlock in kernel context, the lock contention should be much less than userspace Mutex.

So we cannot reference the kernel's design directly.

How does Golang handle this case?

I'm not a Golang expert, please correct me if I'm wrong.

Golang uses per-GOPROC 4-ary heap to register all timers.

Golang marks the timer as deleted canceling a timer, and marked timers will be deleted while registering a new timer if the marked timer is at the top of the heap.

Because all timers are delayed for removing from the per-GOPROC heap, registering a new timer doesn’t need to acquire the lock.

The cost of this design is that if too many (>= 1/4) marked timers are not at the top of the heap, GOPROC must be stopped to scan the entire 4-ary heap, which means trigger a O(nlgn) operations at an any time, which could lead to task starvation.

This might not acceptable in tokio.

Proposed design

The initial idea came from @carllerche .

What does the new TimerEntry look like?

struct TimerEntry {
   // -- omitted -- 
   deadline: Instant,
   node: Option<NonNull<TimerShared>>, // stored into the intrusive list
   // -- omitted --
}

Where is the local wheel stored?

It stores in the Core that also stores the queue::Local.

How to register a new timer?

  1. Retrieve the runtime handle using thread-local CONTEXT.
    1. panic on error (such as no context).
  2. Retrieve the local worker Context.
    1. Register timer to the local wheel if the Core is here.
    2. Otherwise, push into the inject timer list, this list will be drained while maintaining the local core.

How to cancel a timer?

  1. Mark the timer as deleted in AtomicU64.
  2. Send the timer using std::mpsc::Sender, the Receiver is owned by Core, the sender was stored in TimerEntry when creating it.
  3. The core drains the receiver and remove timers from the local wheel.

How to wake up timer?

Timers that were registered in the current worker thread's local wheel will be polled while maintaining the local core, and the AtomicWaker is used for waking up.

Overhead of the extra memory allocation

This design requires the extra memory allocation of the node to be saved in an intrusive list compared to the current design.

This should not be a big concern because modern allocator has well multi-core scalability, so the extra allocation should be cheaper than global lock contention.

FAQ

Will be added progressively as the discussion progresses.

@Darksonn Darksonn added C-proposal Category: a proposal and request for comments A-tokio Area: The main tokio crate M-time Module: tokio/time labels Jun 4, 2025
@Darksonn
Copy link
Contributor

Darksonn commented Jun 6, 2025

I think the biggest question here is how wakeups are carried out. Right now, the single worker thread that sleeps on epoll does so with a timeout equal to the globally smallest timer, which meant that whenever the thread goes to sleep on epoll, it has to lock every shard to determine the smallest timer. The question is ... what is your proposal here? Do we change it so that each worker thread sleeps with a timeout equal to the smallest timer on its own local queue?

Currently, the runtime continues working okay if one of your worker threads get blocked, and people might rely on that. Yes, there are some caveats to this such as the LIFO slot, but I still suspect people still rely on it working.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate C-proposal Category: a proposal and request for comments M-time Module: tokio/time
Projects
None yet
Development

No branches or pull requests

2 participants