Skip to content

scx_p2dq: Prefer dispatching into local CPU DSQs #2307

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

hodgesds
Copy link
Contributor

@hodgesds hodgesds commented Jul 1, 2025

When the local CPU is available prefer dispatching into the per CPU local DSQ. This gives slightly better locality and reduces the number of CPU migrations.

@hodgesds
Copy link
Contributor Author

hodgesds commented Jul 1, 2025

Benchmarks

schbench at partial saturation:

Wakeup Latencies percentiles (usec) runtime 30 (s) (388965 total samples)
          50.0th: 7          (101994 samples)
          90.0th: 11         (133322 samples)
        * 99.0th: 15         (24378 samples)
          99.9th: 38         (2372 samples)
          min=1, max=1674
Request Latencies percentiles (usec) runtime 30 (s) (389566 total samples)
          50.0th: 6984       (118384 samples)
          90.0th: 10896      (152793 samples)
        * 99.0th: 12848      (34926 samples)
          99.9th: 13552      (3427 samples)
          min=6041, max=19590
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12912      (7 samples)
        * 50.0th: 12976      (9 samples)
          90.0th: 13040      (12 samples)
          min=12782, max=13086
average rps: 12985.53

eevdf:

Wakeup Latencies percentiles (usec) runtime 30 (s) (383243 total samples)
          50.0th: 5          (151966 samples)
          90.0th: 7          (129109 samples)
        * 99.0th: 10         (18750 samples)
          99.9th: 21         (2377 samples)
          min=1, max=4062
Request Latencies percentiles (usec) runtime 30 (s) (384063 total samples)
          50.0th: 6936       (120482 samples)
          90.0th: 12368      (146940 samples)
        * 99.0th: 13232      (34532 samples)
          99.9th: 13776      (3350 samples)
          min=5871, max=31208
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12688      (7 samples)
        * 50.0th: 12880      (10 samples)
          90.0th: 13008      (12 samples)
          min=11721, max=13048
average rps: 12802.10

full saturation:

Wakeup Latencies percentiles (usec) runtime 30 (s) (192812 total samples)
          50.0th: 21         (54740 samples)
          90.0th: 125        (77161 samples)
        * 99.0th: 999        (17297 samples)
          99.9th: 1458       (1735 samples)
          min=1, max=5341
Request Latencies percentiles (usec) runtime 30 (s) (193131 total samples)
          50.0th: 13136      (57944 samples)
          90.0th: 14640      (76456 samples)
        * 99.0th: 24544      (17182 samples)
          99.9th: 37696      (1731 samples)
          min=6539, max=70594
RPS percentiles (requests) runtime 30 (s) (15 total samples)
          20.0th: 12784      (6 samples)
        * 50.0th: 12816      (3 samples)
          90.0th: 12912      (6 samples)
          min=12605, max=12910
average rps: 12797.00

vs eevdf:

Wakeup Latencies percentiles (usec) runtime 30 (s) (389464 total samples)
          50.0th: 7          (74153 samples)
          90.0th: 43         (155890 samples)
        * 99.0th: 1942       (34208 samples)
          99.9th: 3948       (3497 samples)
          min=1, max=16199
Request Latencies percentiles (usec) runtime 30 (s) (390265 total samples)
          50.0th: 13136      (110352 samples)
          90.0th: 14384      (154805 samples)
        * 99.0th: 26464      (35065 samples)
          99.9th: 41024      (3481 samples)
          min=6284, max=149716
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12912      (7 samples)
        * 50.0th: 13008      (11 samples)
          90.0th: 13136      (13 samples)
          min=12676, max=13148
average rps: 13008.83

Still lagging a bit at saturation compared to eevdf and the performance is still slightly lower than when there were DSQs per slice. I may bring those DSQs back as I think it provided better load balancing.

Copy link
Contributor

@etsal etsal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes total sense!

Copy link
Contributor

@arighi arighi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment but overall LGTM.

u64 last_run_started;
u64 last_run_at;
u64 llc_runs; /* how many runs on the current LLC */
int last_dsq_index;
s32 last_cpu;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saving last_cpu in the task context for efficiency reasons? Why not using scx_bpf_task_cpu(p) directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was roughly my thinking: If a task finds a different CPU in select that isn't idle it won't get direct dispatched and then hit the enqueue path. However, if the tasks old CPU is idle then it's probably the best CPU to use. Does scx_bpf_task_cpu(p) return the previous CPU in enqueue if a CPU has been selected in select?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, it returns the CPU selected in ops.select_cpu(). But probably in your case it doesn't make any difference, because if the task is not directly dispatched you return prev_cpu from ops.select_cpu(), which is the previous running CPU.

However, if ops.select_cpu() can return a different CPU without doing direct dispatch you're going to ignore this hint and use the prev running CPU anyway.

// If the last CPU is idle just reenque to the CPUs local DSQ. This
// should reduce the number of migrations.
if (scx_bpf_dsq_nr_queued(taskc->last_cpu_dsq_id) == 0 &&
scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON|taskc->last_cpu) == 0) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this check is necessary, or if it should do direct dispatch in this case instead.

Copy link
Contributor

@multics69 multics69 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, the logic makes sense to me. One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.

@hodgesds
Copy link
Contributor Author

hodgesds commented Jul 2, 2025

One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.

Yeah, that makes sense I didn't think/test that much. Maybe it's best to throw this behind a CLI flag and add some stats.

@hodgesds hodgesds force-pushed the p2dq-local-cpu-enq branch 3 times, most recently from d6d44b4 to 344b69e Compare July 7, 2025 15:27
@kkdwivedi
Copy link
Contributor

Overall, the logic makes sense to me. One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.

It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()? You would have to sample the head of the DSQ to decide, but it could help bound the delay in case you encounter this case.

@hodgesds
Copy link
Contributor Author

hodgesds commented Jul 8, 2025

It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()?

I might try this when the DSQ peek is implemented or the migration to arena DSQs is done.

@etsal
Copy link
Contributor

etsal commented Jul 8, 2025

It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()?

I might try this when the DSQ peek is implemented or the migration to arena DSQs is done.

If we do move to local ATQs would a vtime adjustment call be enough to avoid this issue? I'm thinking basically keep per-local ATQs, from where dispatch pulls to the local DSQ last minute. We won't be able to do direct dispatch, though.

When the local CPU is available prefer dispatching into the per CPU
local DSQ. This gives slightly better locality and reduces the number of
CPU migrations.

Signed-off-by: Daniel Hodges <[email protected]>
@hodgesds hodgesds force-pushed the p2dq-local-cpu-enq branch from 344b69e to d4cbb30 Compare July 15, 2025 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants