scx_p2dq: Prefer dispatching into local CPU DSQs #2307

hodgesds · 2025-07-01T20:06:37Z

When the local CPU is available prefer dispatching into the per CPU local DSQ. This gives slightly better locality and reduces the number of CPU migrations.

hodgesds · 2025-07-01T20:19:01Z

Benchmarks

schbench at partial saturation:

Wakeup Latencies percentiles (usec) runtime 30 (s) (388965 total samples)
          50.0th: 7          (101994 samples)
          90.0th: 11         (133322 samples)
        * 99.0th: 15         (24378 samples)
          99.9th: 38         (2372 samples)
          min=1, max=1674
Request Latencies percentiles (usec) runtime 30 (s) (389566 total samples)
          50.0th: 6984       (118384 samples)
          90.0th: 10896      (152793 samples)
        * 99.0th: 12848      (34926 samples)
          99.9th: 13552      (3427 samples)
          min=6041, max=19590
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12912      (7 samples)
        * 50.0th: 12976      (9 samples)
          90.0th: 13040      (12 samples)
          min=12782, max=13086
average rps: 12985.53

eevdf:

Wakeup Latencies percentiles (usec) runtime 30 (s) (383243 total samples)
          50.0th: 5          (151966 samples)
          90.0th: 7          (129109 samples)
        * 99.0th: 10         (18750 samples)
          99.9th: 21         (2377 samples)
          min=1, max=4062
Request Latencies percentiles (usec) runtime 30 (s) (384063 total samples)
          50.0th: 6936       (120482 samples)
          90.0th: 12368      (146940 samples)
        * 99.0th: 13232      (34532 samples)
          99.9th: 13776      (3350 samples)
          min=5871, max=31208
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12688      (7 samples)
        * 50.0th: 12880      (10 samples)
          90.0th: 13008      (12 samples)
          min=11721, max=13048
average rps: 12802.10

full saturation:

Wakeup Latencies percentiles (usec) runtime 30 (s) (192812 total samples)
          50.0th: 21         (54740 samples)
          90.0th: 125        (77161 samples)
        * 99.0th: 999        (17297 samples)
          99.9th: 1458       (1735 samples)
          min=1, max=5341
Request Latencies percentiles (usec) runtime 30 (s) (193131 total samples)
          50.0th: 13136      (57944 samples)
          90.0th: 14640      (76456 samples)
        * 99.0th: 24544      (17182 samples)
          99.9th: 37696      (1731 samples)
          min=6539, max=70594
RPS percentiles (requests) runtime 30 (s) (15 total samples)
          20.0th: 12784      (6 samples)
        * 50.0th: 12816      (3 samples)
          90.0th: 12912      (6 samples)
          min=12605, max=12910
average rps: 12797.00

vs eevdf:

Wakeup Latencies percentiles (usec) runtime 30 (s) (389464 total samples)
          50.0th: 7          (74153 samples)
          90.0th: 43         (155890 samples)
        * 99.0th: 1942       (34208 samples)
          99.9th: 3948       (3497 samples)
          min=1, max=16199
Request Latencies percentiles (usec) runtime 30 (s) (390265 total samples)
          50.0th: 13136      (110352 samples)
          90.0th: 14384      (154805 samples)
        * 99.0th: 26464      (35065 samples)
          99.9th: 41024      (3481 samples)
          min=6284, max=149716
RPS percentiles (requests) runtime 30 (s) (31 total samples)
          20.0th: 12912      (7 samples)
        * 50.0th: 13008      (11 samples)
          90.0th: 13136      (13 samples)
          min=12676, max=13148
average rps: 13008.83

Still lagging a bit at saturation compared to eevdf and the performance is still slightly lower than when there were DSQs per slice. I may bring those DSQs back as I think it provided better load balancing.

etsal

Makes total sense!

arighi

Left a comment but overall LGTM.

arighi · 2025-07-01T20:35:57Z

scheds/rust/scx_p2dq/src/bpf/types.h

 	u64 			last_run_started;
 	u64 			last_run_at;
 	u64			llc_runs; /* how many runs on the current LLC */
 	int			last_dsq_index;
+	s32			last_cpu;


Are you saving last_cpu in the task context for efficiency reasons? Why not using scx_bpf_task_cpu(p) directly?

This was roughly my thinking: If a task finds a different CPU in select that isn't idle it won't get direct dispatched and then hit the enqueue path. However, if the tasks old CPU is idle then it's probably the best CPU to use. Does scx_bpf_task_cpu(p) return the previous CPU in enqueue if a CPU has been selected in select?

Nope, it returns the CPU selected in ops.select_cpu(). But probably in your case it doesn't make any difference, because if the task is not directly dispatched you return prev_cpu from ops.select_cpu(), which is the previous running CPU.

However, if ops.select_cpu() can return a different CPU without doing direct dispatch you're going to ignore this hint and use the prev running CPU anyway.

hodgesds · 2025-07-01T20:58:17Z

scheds/rust/scx_p2dq/src/bpf/main.bpf.c

+	// If the last CPU is idle just reenque to the CPUs local DSQ. This
+	// should reduce the number of migrations.
+	if (scx_bpf_dsq_nr_queued(taskc->last_cpu_dsq_id) == 0 &&
+	    scx_bpf_dsq_nr_queued(SCX_DSQ_LOCAL_ON|taskc->last_cpu) == 0) {


I wonder if this check is necessary, or if it should do direct dispatch in this case instead.

multics69

Overall, the logic makes sense to me. One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.

hodgesds · 2025-07-02T01:49:14Z

One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.

Yeah, that makes sense I didn't think/test that much. Maybe it's best to throw this behind a CLI flag and add some stats.

kkdwivedi · 2025-07-08T00:35:09Z

Overall, the logic makes sense to me. One thing might need to be considered is that I guess the fast path to the local DSQ might incur priority inversion-like situation, running a less-crticail task on the local DSQ first while there is more critical task on the shared DSQ.

It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()? You would have to sample the head of the DSQ to decide, but it could help bound the delay in case you encounter this case.

hodgesds · 2025-07-08T19:20:02Z

It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()?

I might try this when the DSQ peek is implemented or the migration to arena DSQs is done.

etsal · 2025-07-08T20:09:57Z

It feels like a hack, but could you address this by truncating the slice of the less critical task in ops.tick()?

I might try this when the DSQ peek is implemented or the migration to arena DSQs is done.

If we do move to local ATQs would a vtime adjustment call be enough to avoid this issue? I'm thinking basically keep per-local ATQs, from where dispatch pulls to the local DSQ last minute. We won't be able to do direct dispatch, though.

When the local CPU is available prefer dispatching into the per CPU local DSQ. This gives slightly better locality and reduces the number of CPU migrations. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds requested review from arighi, htejun, multics69, JakeHillion, etsal and kkdwivedi July 1, 2025 20:06

etsal approved these changes Jul 1, 2025

View reviewed changes

arighi approved these changes Jul 1, 2025

View reviewed changes

hodgesds commented Jul 1, 2025

View reviewed changes

multics69 approved these changes Jul 2, 2025

View reviewed changes

hodgesds force-pushed the p2dq-local-cpu-enq branch 3 times, most recently from d6d44b4 to 344b69e Compare July 7, 2025 15:27

scx_p2dq: Prefer dispatching into local CPU DSQs

d4cbb30

When the local CPU is available prefer dispatching into the per CPU local DSQ. This gives slightly better locality and reduces the number of CPU migrations. Signed-off-by: Daniel Hodges <[email protected]>

hodgesds force-pushed the p2dq-local-cpu-enq branch from 344b69e to d4cbb30 Compare July 15, 2025 21:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

scx_p2dq: Prefer dispatching into local CPU DSQs #2307

scx_p2dq: Prefer dispatching into local CPU DSQs #2307

Uh oh!

hodgesds commented Jul 1, 2025

Uh oh!

hodgesds commented Jul 1, 2025

Uh oh!

etsal left a comment

Uh oh!

arighi left a comment

Uh oh!

arighi Jul 1, 2025

Uh oh!

hodgesds Jul 1, 2025

Uh oh!

arighi Jul 1, 2025

Uh oh!

hodgesds Jul 1, 2025

Uh oh!

multics69 left a comment

Uh oh!

hodgesds commented Jul 2, 2025

Uh oh!

kkdwivedi commented Jul 8, 2025

Uh oh!

hodgesds commented Jul 8, 2025

Uh oh!

etsal commented Jul 8, 2025

Uh oh!

Uh oh!

scx_p2dq: Prefer dispatching into local CPU DSQs #2307

Are you sure you want to change the base?

scx_p2dq: Prefer dispatching into local CPU DSQs #2307

Uh oh!

Conversation

hodgesds commented Jul 1, 2025

Uh oh!

hodgesds commented Jul 1, 2025

Benchmarks

Uh oh!

etsal left a comment

Choose a reason for hiding this comment

Uh oh!

arighi left a comment

Choose a reason for hiding this comment

Uh oh!

arighi Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

hodgesds Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

arighi Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

hodgesds Jul 1, 2025

Choose a reason for hiding this comment

Uh oh!

multics69 left a comment

Choose a reason for hiding this comment

Uh oh!

hodgesds commented Jul 2, 2025

Uh oh!

kkdwivedi commented Jul 8, 2025

Uh oh!

hodgesds commented Jul 8, 2025

Uh oh!

etsal commented Jul 8, 2025

Uh oh!

Uh oh!