Use Tokio's task budget consistently, better APIs to support task cancellation #16398

pepijnve · 2025-06-13T12:01:42Z

Which issue does this PR close?

Closes [Epic] Pipeline breaking cancellation support and improvement #16353.

Rationale for this change

RecordBatchStreamReceiver supports cooperative scheduling implicitly by using Tokio's task budget. YieldStream currently uses a custom mechanism. It would be better to use a single mechanism consistently.

What changes are included in this PR?

Renamed YieldStream and related types to CooperativeStream
Removed configuration option which is no longer applicable
Enabled cooperative scheduling in spill manager

Note that the implementation of CooperativeStream in this PR is suboptimal. The final implementation requires tokio-rs/tokio#7405 which I'm trying to move along as best I can.

Are these changes tested?

Covered by ~~infinite_cancel~~coop test.

Are there any user-facing changes?

Yes, the datafusion.optimizer.yield_period configuration option is removed, but at the time of writing this has not been released yet.

ozankabak · 2025-06-13T14:00:13Z

Thanks for the draft -- this is inline with my understanding from your description. I think it will inch us closer to a good, lasting solution (especially after your upstream tokio also PR merges). Feel free to ping me for a more detailed review once you are done with it

pepijnve · 2025-06-13T15:20:33Z

@ozankabak I've pushed the optimizer rule changes I had in mind. This introduces two new execution plan properties that capture the evaluation type (how children are evaluated: eager vs lazy) and the scheduling type (how poll_next will behave wrt scheduling: blocking vs cooperative).

With those two combined the tree can be rewritten in a bottom up fashion. Every leaf that is not cooperative gets wrapped as before. Additionally, any eager evaluating nodes (i.e. exchanges) that are not cooperative are wrapped. This should ensure the entire plan participates in cooperative scheduling.

The only caveat that remains is dynamic stream creation. Operators that do that need to take the necessary precautions themselves. I already update the spill manager for this in the previous commit.

While I was writing this I started wondering if evaluation type should be a per child thing. In my spawn experiment branch for instance hash join is eager for the build side, but lazy for the probe side. Perhaps it would be best to leave room for that.

ozankabak · 2025-06-14T11:19:04Z

While I was writing this I started wondering if evaluation type should be a per child thing. In my spawn experiment branch for instance hash join is eager for the build side, but lazy for the probe side. Perhaps it would be best to leave room for that.

This is in alignment with what I was thinking, let's do it that way

pepijnve · 2025-06-14T19:29:15Z

Thinking about it some more. The evaluation type is intended to describe how the operator computes record batches itself: lazy on demand, or by driving things itself. I’m kind of trying to refer to the terminology from the volcano paper. That talks about demand-driven and data-driven operators. I had first called this 'drive type' with values 'demand' and 'data', but that felt a bit awkward. Since this is actually a property of how the operator prepares its output, one value per operator is probably fine after all.

What I'm trying to do with this is find the exchanges in the plan. The current set that's present in DataFusion is all fine, but if you were to implement one using std::sync::mpsc::channel instead of the one from tokio, explicit cooperation with the scheduler would be necessary again.

pepijnve · 2025-06-14T19:29:44Z

Open to suggestions on better names for these properties.

…erties

…l cooperation variants

alamb · 2025-06-17T20:00:48Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing task_budget (3b33f79) to 1429c92 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

alamb · 2025-06-17T20:40:01Z

🤖: Benchmark completed

Details

Comparing HEAD and task_budget
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ task_budget ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1949.96 ms │  1930.13 ms │    no change │
│ QQuery 1     │   675.49 ms │   729.71 ms │ 1.08x slower │
│ QQuery 2     │  1340.52 ms │  1378.26 ms │    no change │
│ QQuery 3     │   662.34 ms │   658.36 ms │    no change │
│ QQuery 4     │  1373.39 ms │  1343.97 ms │    no change │
│ QQuery 5     │ 15044.29 ms │ 14931.43 ms │    no change │
│ QQuery 6     │  2031.52 ms │  2100.06 ms │    no change │
│ QQuery 7     │  1949.16 ms │  1911.84 ms │    no change │
│ QQuery 8     │   790.17 ms │   811.49 ms │    no change │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 25816.84ms │
│ Total Time (task_budget)   │ 25795.26ms │
│ Average Time (HEAD)        │  2868.54ms │
│ Average Time (task_budget) │  2866.14ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          1 │
│ Queries with No Change     │          8 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ task_budget ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.07 ms │    15.52 ms │    no change │
│ QQuery 1     │    32.39 ms │    33.64 ms │    no change │
│ QQuery 2     │    81.66 ms │    81.29 ms │    no change │
│ QQuery 3     │    97.11 ms │    97.56 ms │    no change │
│ QQuery 4     │   591.09 ms │   593.49 ms │    no change │
│ QQuery 5     │   821.63 ms │   872.05 ms │ 1.06x slower │
│ QQuery 6     │    22.34 ms │    22.99 ms │    no change │
│ QQuery 7     │    36.75 ms │    36.65 ms │    no change │
│ QQuery 8     │   844.36 ms │   865.78 ms │    no change │
│ QQuery 9     │  1143.96 ms │  1148.17 ms │    no change │
│ QQuery 10    │   253.88 ms │   253.51 ms │    no change │
│ QQuery 11    │   277.72 ms │   288.38 ms │    no change │
│ QQuery 12    │   851.39 ms │   871.27 ms │    no change │
│ QQuery 13    │  1121.16 ms │  1237.87 ms │ 1.10x slower │
│ QQuery 14    │   807.20 ms │   810.33 ms │    no change │
│ QQuery 15    │   740.12 ms │   757.64 ms │    no change │
│ QQuery 16    │  1601.02 ms │  1610.53 ms │    no change │
│ QQuery 17    │  1610.29 ms │  1607.44 ms │    no change │
│ QQuery 18    │  2862.22 ms │  2887.82 ms │    no change │
│ QQuery 19    │    81.40 ms │    85.86 ms │ 1.05x slower │
│ QQuery 20    │  1137.87 ms │  1173.96 ms │    no change │
│ QQuery 21    │  1276.87 ms │  1312.58 ms │    no change │
│ QQuery 22    │  2134.80 ms │  2210.32 ms │    no change │
│ QQuery 23    │  7399.34 ms │  7585.73 ms │    no change │
│ QQuery 24    │   434.38 ms │   454.37 ms │    no change │
│ QQuery 25    │   304.06 ms │   312.91 ms │    no change │
│ QQuery 26    │   434.86 ms │   458.44 ms │ 1.05x slower │
│ QQuery 27    │  1525.86 ms │  1576.08 ms │    no change │
│ QQuery 28    │ 11727.04 ms │ 12045.71 ms │    no change │
│ QQuery 29    │   524.27 ms │   516.39 ms │    no change │
│ QQuery 30    │   770.12 ms │   793.40 ms │    no change │
│ QQuery 31    │   801.85 ms │   824.47 ms │    no change │
│ QQuery 32    │  2409.23 ms │  2419.15 ms │    no change │
│ QQuery 33    │  3090.45 ms │  3143.57 ms │    no change │
│ QQuery 34    │  3152.12 ms │  3158.17 ms │    no change │
│ QQuery 35    │  1183.10 ms │  1238.98 ms │    no change │
│ QQuery 36    │   129.34 ms │   125.39 ms │    no change │
│ QQuery 37    │    54.74 ms │    57.12 ms │    no change │
│ QQuery 38    │   122.62 ms │   125.25 ms │    no change │
│ QQuery 39    │   199.54 ms │   195.83 ms │    no change │
│ QQuery 40    │    47.79 ms │    49.30 ms │    no change │
│ QQuery 41    │    44.60 ms │    44.50 ms │    no change │
│ QQuery 42    │    39.35 ms │    38.99 ms │    no change │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 52836.93ms │
│ Total Time (task_budget)   │ 54038.40ms │
│ Average Time (HEAD)        │  1228.77ms │
│ Average Time (task_budget) │  1256.71ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          4 │
│ Queries with No Change     │         39 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ task_budget ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 102.55 ms │    99.10 ms │ no change │
│ QQuery 2     │  20.94 ms │    20.73 ms │ no change │
│ QQuery 3     │  32.32 ms │    32.03 ms │ no change │
│ QQuery 4     │  18.53 ms │    18.75 ms │ no change │
│ QQuery 5     │  48.91 ms │    48.40 ms │ no change │
│ QQuery 6     │  11.91 ms │    11.84 ms │ no change │
│ QQuery 7     │  87.18 ms │    87.96 ms │ no change │
│ QQuery 8     │  25.12 ms │    24.58 ms │ no change │
│ QQuery 9     │  53.78 ms │    55.14 ms │ no change │
│ QQuery 10    │  43.11 ms │    42.77 ms │ no change │
│ QQuery 11    │  11.14 ms │    11.34 ms │ no change │
│ QQuery 12    │  34.92 ms │    34.46 ms │ no change │
│ QQuery 13    │  25.68 ms │    25.61 ms │ no change │
│ QQuery 14    │   9.52 ms │     9.68 ms │ no change │
│ QQuery 15    │  19.06 ms │    18.70 ms │ no change │
│ QQuery 16    │  18.70 ms │    18.64 ms │ no change │
│ QQuery 17    │  93.82 ms │    96.06 ms │ no change │
│ QQuery 18    │ 194.66 ms │   194.28 ms │ no change │
│ QQuery 19    │  25.27 ms │    25.00 ms │ no change │
│ QQuery 20    │  32.21 ms │    31.37 ms │ no change │
│ QQuery 21    │ 144.09 ms │   146.38 ms │ no change │
│ QQuery 22    │  15.76 ms │    15.12 ms │ no change │
└──────────────┴───────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1069.18ms │
│ Total Time (task_budget)   │ 1067.95ms │
│ Average Time (HEAD)        │   48.60ms │
│ Average Time (task_budget) │   48.54ms │
│ Queries Faster             │         0 │
│ Queries Slower             │         0 │
│ Queries with No Change     │        22 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

alamb · 2025-06-17T20:40:03Z

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing task_budget (3b33f79) to 1429c92 diff
Benchmarks: clickbench_1
Results will be posted here when complete

alamb · 2025-06-17T20:49:42Z

🤖: Benchmark completed

Details

Comparing HEAD and task_budget
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ task_budget ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    45.55 ms │    47.02 ms │     no change │
│ QQuery 1     │    70.90 ms │    74.64 ms │  1.05x slower │
│ QQuery 2     │   108.60 ms │   108.42 ms │     no change │
│ QQuery 3     │   118.38 ms │   117.48 ms │     no change │
│ QQuery 4     │   614.68 ms │   702.11 ms │  1.14x slower │
│ QQuery 5     │   843.56 ms │   868.27 ms │     no change │
│ QQuery 6     │    53.99 ms │    54.51 ms │     no change │
│ QQuery 7     │    78.46 ms │    78.61 ms │     no change │
│ QQuery 8     │   887.11 ms │   904.12 ms │     no change │
│ QQuery 9     │  1173.21 ms │  1225.36 ms │     no change │
│ QQuery 10    │   288.90 ms │   287.56 ms │     no change │
│ QQuery 11    │   313.83 ms │   324.76 ms │     no change │
│ QQuery 12    │   855.16 ms │   892.54 ms │     no change │
│ QQuery 13    │  1219.67 ms │  1250.80 ms │     no change │
│ QQuery 14    │   814.84 ms │   816.85 ms │     no change │
│ QQuery 15    │   795.60 ms │   809.07 ms │     no change │
│ QQuery 16    │  1603.15 ms │  1642.79 ms │     no change │
│ QQuery 17    │  1611.43 ms │  1620.07 ms │     no change │
│ QQuery 18    │  2884.61 ms │  2906.00 ms │     no change │
│ QQuery 19    │   121.20 ms │   121.38 ms │     no change │
│ QQuery 20    │  1180.69 ms │  1189.69 ms │     no change │
│ QQuery 21    │  1371.20 ms │  1394.64 ms │     no change │
│ QQuery 22    │  2379.76 ms │  2412.27 ms │     no change │
│ QQuery 23    │  7945.62 ms │  8090.22 ms │     no change │
│ QQuery 24    │   460.30 ms │   470.69 ms │     no change │
│ QQuery 25    │   346.40 ms │   347.84 ms │     no change │
│ QQuery 26    │   451.87 ms │   477.27 ms │  1.06x slower │
│ QQuery 27    │  1650.72 ms │  1684.69 ms │     no change │
│ QQuery 28    │ 12400.61 ms │ 12486.22 ms │     no change │
│ QQuery 29    │   568.75 ms │   554.71 ms │     no change │
│ QQuery 30    │   797.26 ms │   806.62 ms │     no change │
│ QQuery 31    │   822.99 ms │   854.98 ms │     no change │
│ QQuery 32    │  2427.54 ms │  2465.81 ms │     no change │
│ QQuery 33    │  3173.31 ms │  3268.38 ms │     no change │
│ QQuery 34    │  3262.48 ms │  3273.74 ms │     no change │
│ QQuery 35    │  1259.76 ms │  1276.50 ms │     no change │
│ QQuery 36    │   167.14 ms │   173.11 ms │     no change │
│ QQuery 37    │    98.87 ms │   106.56 ms │  1.08x slower │
│ QQuery 38    │   173.53 ms │   171.48 ms │     no change │
│ QQuery 39    │   250.93 ms │   254.95 ms │     no change │
│ QQuery 40    │    73.28 ms │    80.35 ms │  1.10x slower │
│ QQuery 41    │    83.66 ms │    79.01 ms │ +1.06x faster │
│ QQuery 42    │    76.39 ms │    74.95 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 55925.88ms │
│ Total Time (task_budget)   │ 56847.01ms │
│ Average Time (HEAD)        │  1300.60ms │
│ Average Time (task_budget) │  1322.02ms │
│ Queries Faster             │          1 │
│ Queries Slower             │          5 │
│ Queries with No Change     │         37 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘

alamb · 2025-06-18T21:15:50Z

I took the liberty of merging up from main to resolve a logical conflict

Dandandan · 2025-06-18T21:53:53Z

hmm there seems to be some regressions there...

zhuqi-lucas · 2025-06-19T03:24:28Z

Yeah, the clickbench benchmark shows a little slower, it seems can be reproduced, about total time 1000ms slower. I am not sure if it's a noise.

hmm there seems to be some regressions there...

pepijnve force-pushed the task_budget branch 8 times, most recently from 75fd648 to 935db91 Compare June 13, 2025 13:57

alamb mentioned this pull request Jun 13, 2025

[Epic] Pipeline breaking cancellation support and improvement #16353

Open

6 tasks

pepijnve force-pushed the task_budget branch from f45bec4 to 2f870b5 Compare June 13, 2025 15:16

pepijnve force-pushed the task_budget branch 6 times, most recently from b593bfa to c648c0c Compare June 13, 2025 19:07

pepijnve mentioned this pull request Jun 14, 2025

Improve ability to cancel queries quickly #16301

Closed

pepijnve added 13 commits June 17, 2025 21:52

Rework ensure_coop to base itself on evaluation and scheduling prop…

e1bb756

…erties

Iterating on documentation

a096145

Improve robustness of cooperative yielding test cases

173c17f

Reorganize tests by operator a bit better

828155b

Coop documentation

fd3e40c

More coop documentation

c9a2df2

Avoid Box in temporary CooperativeStream::poll_next implementation

2b86eae

Adapt interleave test cases for range generator

69ea290

Add temporary tokio_coop feature to unblock merging

6c014da

Extract magic number to constant

9e65459

Fix documentation error

a10ad8c

Push scheduling type down from DataSourceExec to DataSource

c8c71e5

Use custom configuration instead of feature to avoid exposing interna…

f6f866c

…l cooperation variants

pepijnve force-pushed the task_budget branch from dda6ce0 to 9c5468d Compare June 17, 2025 19:57

pepijnve added 4 commits June 17, 2025 21:58

Use dedicated enum for yield results

15eed9e

Documentation improvements from review

af1592d

More documentation

83fe18f

Change default coop strategy to 'tokio_fallback'

3b33f79

pepijnve force-pushed the task_budget branch from 9c5468d to 3b33f79 Compare June 17, 2025 19:58

pepijnve and others added 4 commits June 17, 2025 22:50

Documentation refinement

c663da3

Re-enable interleave test cases

74ff949

Merge remote-tracking branch 'apache/main' into task_budget

381d3f1

fix logical merge conflict

ff29b16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

pepijnve commented Jun 13, 2025 •

edited

Loading

Uh oh!

ozankabak commented Jun 13, 2025

Uh oh!

pepijnve commented Jun 13, 2025 •

edited

Loading

Uh oh!

ozankabak commented Jun 14, 2025

Uh oh!

pepijnve commented Jun 14, 2025 •

edited

Loading

Uh oh!

pepijnve commented Jun 14, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

Dandandan commented Jun 18, 2025

Uh oh!

zhuqi-lucas commented Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

Are you sure you want to change the base?

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

Conversation

pepijnve commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

ozankabak commented Jun 13, 2025

Uh oh!

pepijnve commented Jun 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ozankabak commented Jun 14, 2025

Uh oh!

pepijnve commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pepijnve commented Jun 14, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 17, 2025

Uh oh!

alamb commented Jun 18, 2025

Uh oh!

Dandandan commented Jun 18, 2025

Uh oh!

zhuqi-lucas commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

pepijnve commented Jun 13, 2025 •

edited

Loading

pepijnve commented Jun 13, 2025 •

edited

Loading

pepijnve commented Jun 14, 2025 •

edited

Loading

zhuqi-lucas commented Jun 19, 2025 •

edited

Loading