-
Notifications
You must be signed in to change notification settings - Fork 163
scx_p2dq: Prefer dispatching into local CPU DSQs #2307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Benchmarks
full saturation:
vs
Still lagging a bit at saturation compared to eevdf and the performance is still slightly lower than when there were DSQs per slice. I may bring those DSQs back as I think it provided better load balancing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes total sense!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment but overall LGTM.
u64 last_run_started; | ||
u64 last_run_at; | ||
u64 llc_runs; /* how many runs on the current LLC */ | ||
int last_dsq_index; | ||
s32 last_cpu; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saving last_cpu in the task context for efficiency reasons? Why not using scx_bpf_task_cpu(p)
directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was roughly my thinking: If a task finds a different CPU in select
that isn't idle it won't get direct dispatched and then hit the enqueue path. However, if the tasks old CPU is idle then it's probably the best CPU to use. Does scx_bpf_task_cpu(p)
return the previous CPU in enqueue
if a CPU has been selected in select
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, it returns the CPU selected in ops.select_cpu(). But probably in your case it doesn't make any difference, because if the task is not directly dispatched you return prev_cpu from ops.select_cpu(), which is the previous running CPU.
However, if ops.select_cpu() can return a different CPU without doing direct dispatch you're going to ignore this hint and use the prev running CPU anyway.
// If the last CPU is idle just reenque to the CPUs local DSQ. This | ||
// should reduce the number of migrations. | ||
if (scx_bpf_dsq_nr_queued(taskc->last_cpu_dsq_id) == 0 && | ||
scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON|taskc->last_cpu) == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if this check is necessary, or if it should do direct dispatch in this case instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, the logic makes sense to me. One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.
Yeah, that makes sense I didn't think/test that much. Maybe it's best to throw this behind a CLI flag and add some stats. |
d6d44b4
to
344b69e
Compare
It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()? You would have to sample the head of the DSQ to decide, but it could help bound the delay in case you encounter this case. |
I might try this when the DSQ peek is implemented or the migration to arena DSQs is done. |
If we do move to local ATQs would a vtime adjustment call be enough to avoid this issue? I'm thinking basically keep per-local ATQs, from where dispatch pulls to the local DSQ last minute. We won't be able to do direct dispatch, though. |
When the local CPU is available prefer dispatching into the per CPU local DSQ. This gives slightly better locality and reduces the number of CPU migrations. Signed-off-by: Daniel Hodges <[email protected]>
344b69e
to
d4cbb30
Compare
When the local CPU is available prefer dispatching into the per CPU local DSQ. This gives slightly better locality and reduces the number of CPU migrations.