Skip to content

Use Tokio's task budget consistently, better APIs to support task cancellation #16398

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 22 commits into
base: main
Choose a base branch
from

Conversation

pepijnve
Copy link
Contributor

@pepijnve pepijnve commented Jun 13, 2025

Which issue does this PR close?

Rationale for this change

RecordBatchStreamReceiver supports cooperative scheduling implicitly by using Tokio's task budget. YieldStream currently uses a custom mechanism. It would be better to use a single mechanism consistently.

What changes are included in this PR?

  • Renamed YieldStream and related types to CooperativeStream
  • Removed configuration option which is no longer applicable
  • Enabled cooperative scheduling in spill manager

Note that the implementation of CooperativeStream in this PR is suboptimal. The final implementation requires tokio-rs/tokio#7405 which I'm trying to move along as best I can.

Are these changes tested?

Covered by infinite_cancelcoop test.

Are there any user-facing changes?

Yes, the datafusion.optimizer.yield_period configuration option is removed, but at the time of writing this has not been released yet.

@github-actions github-actions bot added documentation Improvements or additions to documentation optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) common Related to common crate proto Related to proto crate datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Jun 13, 2025
@pepijnve pepijnve force-pushed the task_budget branch 8 times, most recently from 75fd648 to 935db91 Compare June 13, 2025 13:57
@ozankabak
Copy link
Contributor

Thanks for the draft -- this is inline with my understanding from your description. I think it will inch us closer to a good, lasting solution (especially after your upstream tokio also PR merges). Feel free to ping me for a more detailed review once you are done with it

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 13, 2025

@ozankabak I've pushed the optimizer rule changes I had in mind. This introduces two new execution plan properties that capture the evaluation type (how children are evaluated: eager vs lazy) and the scheduling type (how poll_next will behave wrt scheduling: blocking vs cooperative).

With those two combined the tree can be rewritten in a bottom up fashion. Every leaf that is not cooperative gets wrapped as before. Additionally, any eager evaluating nodes (i.e. exchanges) that are not cooperative are wrapped. This should ensure the entire plan participates in cooperative scheduling.

The only caveat that remains is dynamic stream creation. Operators that do that need to take the necessary precautions themselves. I already update the spill manager for this in the previous commit.

While I was writing this I started wondering if evaluation type should be a per child thing. In my spawn experiment branch for instance hash join is eager for the build side, but lazy for the probe side. Perhaps it would be best to leave room for that.

@pepijnve pepijnve force-pushed the task_budget branch 6 times, most recently from b593bfa to c648c0c Compare June 13, 2025 19:07
@ozankabak
Copy link
Contributor

While I was writing this I started wondering if evaluation type should be a per child thing. In my spawn experiment branch for instance hash join is eager for the build side, but lazy for the probe side. Perhaps it would be best to leave room for that.

This is in alignment with what I was thinking, let's do it that way

@pepijnve
Copy link
Contributor Author

pepijnve commented Jun 14, 2025

Thinking about it some more. The evaluation type is intended to describe how the operator computes record batches itself: lazy on demand, or by driving things itself. I’m kind of trying to refer to the terminology from the volcano paper. That talks about demand-driven and data-driven operators. I had first called this 'drive type' with values 'demand' and 'data', but that felt a bit awkward. Since this is actually a property of how the operator prepares its output, one value per operator is probably fine after all.

What I'm trying to do with this is find the exchanges in the plan. The current set that's present in DataFusion is all fine, but if you were to implement one using std::sync::mpsc::channel instead of the one from tokio, explicit cooperation with the scheduler would be necessary again.

@pepijnve
Copy link
Contributor Author

Open to suggestions on better names for these properties.

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing task_budget (3b33f79) to 1429c92 diff
Benchmarks: tpch_mem clickbench_partitioned clickbench_extended
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

🤖: Benchmark completed

Details

Comparing HEAD and task_budget
--------------------
Benchmark clickbench_extended.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ task_budget ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │  1949.96 ms │  1930.13 ms │    no change │
│ QQuery 1     │   675.49 ms │   729.71 ms │ 1.08x slower │
│ QQuery 2     │  1340.52 ms │  1378.26 ms │    no change │
│ QQuery 3     │   662.34 ms │   658.36 ms │    no change │
│ QQuery 4     │  1373.39 ms │  1343.97 ms │    no change │
│ QQuery 5     │ 15044.29 ms │ 14931.43 ms │    no change │
│ QQuery 6     │  2031.52 ms │  2100.06 ms │    no change │
│ QQuery 7     │  1949.16 ms │  1911.84 ms │    no change │
│ QQuery 8     │   790.17 ms │   811.49 ms │    no change │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 25816.84ms │
│ Total Time (task_budget)   │ 25795.26ms │
│ Average Time (HEAD)        │  2868.54ms │
│ Average Time (task_budget) │  2866.14ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          1 │
│ Queries with No Change     │          8 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ task_budget ┃       Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ QQuery 0     │    15.07 ms │    15.52 ms │    no change │
│ QQuery 1     │    32.39 ms │    33.64 ms │    no change │
│ QQuery 2     │    81.66 ms │    81.29 ms │    no change │
│ QQuery 3     │    97.11 ms │    97.56 ms │    no change │
│ QQuery 4     │   591.09 ms │   593.49 ms │    no change │
│ QQuery 5     │   821.63 ms │   872.05 ms │ 1.06x slower │
│ QQuery 6     │    22.34 ms │    22.99 ms │    no change │
│ QQuery 7     │    36.75 ms │    36.65 ms │    no change │
│ QQuery 8     │   844.36 ms │   865.78 ms │    no change │
│ QQuery 9     │  1143.96 ms │  1148.17 ms │    no change │
│ QQuery 10    │   253.88 ms │   253.51 ms │    no change │
│ QQuery 11    │   277.72 ms │   288.38 ms │    no change │
│ QQuery 12    │   851.39 ms │   871.27 ms │    no change │
│ QQuery 13    │  1121.16 ms │  1237.87 ms │ 1.10x slower │
│ QQuery 14    │   807.20 ms │   810.33 ms │    no change │
│ QQuery 15    │   740.12 ms │   757.64 ms │    no change │
│ QQuery 16    │  1601.02 ms │  1610.53 ms │    no change │
│ QQuery 17    │  1610.29 ms │  1607.44 ms │    no change │
│ QQuery 18    │  2862.22 ms │  2887.82 ms │    no change │
│ QQuery 19    │    81.40 ms │    85.86 ms │ 1.05x slower │
│ QQuery 20    │  1137.87 ms │  1173.96 ms │    no change │
│ QQuery 21    │  1276.87 ms │  1312.58 ms │    no change │
│ QQuery 22    │  2134.80 ms │  2210.32 ms │    no change │
│ QQuery 23    │  7399.34 ms │  7585.73 ms │    no change │
│ QQuery 24    │   434.38 ms │   454.37 ms │    no change │
│ QQuery 25    │   304.06 ms │   312.91 ms │    no change │
│ QQuery 26    │   434.86 ms │   458.44 ms │ 1.05x slower │
│ QQuery 27    │  1525.86 ms │  1576.08 ms │    no change │
│ QQuery 28    │ 11727.04 ms │ 12045.71 ms │    no change │
│ QQuery 29    │   524.27 ms │   516.39 ms │    no change │
│ QQuery 30    │   770.12 ms │   793.40 ms │    no change │
│ QQuery 31    │   801.85 ms │   824.47 ms │    no change │
│ QQuery 32    │  2409.23 ms │  2419.15 ms │    no change │
│ QQuery 33    │  3090.45 ms │  3143.57 ms │    no change │
│ QQuery 34    │  3152.12 ms │  3158.17 ms │    no change │
│ QQuery 35    │  1183.10 ms │  1238.98 ms │    no change │
│ QQuery 36    │   129.34 ms │   125.39 ms │    no change │
│ QQuery 37    │    54.74 ms │    57.12 ms │    no change │
│ QQuery 38    │   122.62 ms │   125.25 ms │    no change │
│ QQuery 39    │   199.54 ms │   195.83 ms │    no change │
│ QQuery 40    │    47.79 ms │    49.30 ms │    no change │
│ QQuery 41    │    44.60 ms │    44.50 ms │    no change │
│ QQuery 42    │    39.35 ms │    38.99 ms │    no change │
└──────────────┴─────────────┴─────────────┴──────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 52836.93ms │
│ Total Time (task_budget)   │ 54038.40ms │
│ Average Time (HEAD)        │  1228.77ms │
│ Average Time (task_budget) │  1256.71ms │
│ Queries Faster             │          0 │
│ Queries Slower             │          4 │
│ Queries with No Change     │         39 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘
--------------------
Benchmark tpch_mem_sf1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Query        ┃      HEAD ┃ task_budget ┃    Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ QQuery 1     │ 102.55 ms │    99.10 ms │ no change │
│ QQuery 2     │  20.94 ms │    20.73 ms │ no change │
│ QQuery 3     │  32.32 ms │    32.03 ms │ no change │
│ QQuery 4     │  18.53 ms │    18.75 ms │ no change │
│ QQuery 5     │  48.91 ms │    48.40 ms │ no change │
│ QQuery 6     │  11.91 ms │    11.84 ms │ no change │
│ QQuery 7     │  87.18 ms │    87.96 ms │ no change │
│ QQuery 8     │  25.12 ms │    24.58 ms │ no change │
│ QQuery 9     │  53.78 ms │    55.14 ms │ no change │
│ QQuery 10    │  43.11 ms │    42.77 ms │ no change │
│ QQuery 11    │  11.14 ms │    11.34 ms │ no change │
│ QQuery 12    │  34.92 ms │    34.46 ms │ no change │
│ QQuery 13    │  25.68 ms │    25.61 ms │ no change │
│ QQuery 14    │   9.52 ms │     9.68 ms │ no change │
│ QQuery 15    │  19.06 ms │    18.70 ms │ no change │
│ QQuery 16    │  18.70 ms │    18.64 ms │ no change │
│ QQuery 17    │  93.82 ms │    96.06 ms │ no change │
│ QQuery 18    │ 194.66 ms │   194.28 ms │ no change │
│ QQuery 19    │  25.27 ms │    25.00 ms │ no change │
│ QQuery 20    │  32.21 ms │    31.37 ms │ no change │
│ QQuery 21    │ 144.09 ms │   146.38 ms │ no change │
│ QQuery 22    │  15.76 ms │    15.12 ms │ no change │
└──────────────┴───────────┴─────────────┴───────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary          ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 1069.18ms │
│ Total Time (task_budget)   │ 1067.95ms │
│ Average Time (HEAD)        │   48.60ms │
│ Average Time (task_budget) │   48.54ms │
│ Queries Faster             │         0 │
│ Queries Slower             │         0 │
│ Queries with No Change     │        22 │
│ Queries with Failure       │         0 │
└────────────────────────────┴───────────┘

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

🤖 ./gh_compare_branch.sh Benchmark Script Running
Linux aal-dev 6.11.0-1015-gcp #15~24.04.1-Ubuntu SMP Thu Apr 24 20:41:05 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Comparing task_budget (3b33f79) to 1429c92 diff
Benchmarks: clickbench_1
Results will be posted here when complete

@alamb
Copy link
Contributor

alamb commented Jun 17, 2025

🤖: Benchmark completed

Details

Comparing HEAD and task_budget
--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃        HEAD ┃ task_budget ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │    45.55 ms │    47.02 ms │     no change │
│ QQuery 1     │    70.90 ms │    74.64 ms │  1.05x slower │
│ QQuery 2     │   108.60 ms │   108.42 ms │     no change │
│ QQuery 3     │   118.38 ms │   117.48 ms │     no change │
│ QQuery 4     │   614.68 ms │   702.11 ms │  1.14x slower │
│ QQuery 5     │   843.56 ms │   868.27 ms │     no change │
│ QQuery 6     │    53.99 ms │    54.51 ms │     no change │
│ QQuery 7     │    78.46 ms │    78.61 ms │     no change │
│ QQuery 8     │   887.11 ms │   904.12 ms │     no change │
│ QQuery 9     │  1173.21 ms │  1225.36 ms │     no change │
│ QQuery 10    │   288.90 ms │   287.56 ms │     no change │
│ QQuery 11    │   313.83 ms │   324.76 ms │     no change │
│ QQuery 12    │   855.16 ms │   892.54 ms │     no change │
│ QQuery 13    │  1219.67 ms │  1250.80 ms │     no change │
│ QQuery 14    │   814.84 ms │   816.85 ms │     no change │
│ QQuery 15    │   795.60 ms │   809.07 ms │     no change │
│ QQuery 16    │  1603.15 ms │  1642.79 ms │     no change │
│ QQuery 17    │  1611.43 ms │  1620.07 ms │     no change │
│ QQuery 18    │  2884.61 ms │  2906.00 ms │     no change │
│ QQuery 19    │   121.20 ms │   121.38 ms │     no change │
│ QQuery 20    │  1180.69 ms │  1189.69 ms │     no change │
│ QQuery 21    │  1371.20 ms │  1394.64 ms │     no change │
│ QQuery 22    │  2379.76 ms │  2412.27 ms │     no change │
│ QQuery 23    │  7945.62 ms │  8090.22 ms │     no change │
│ QQuery 24    │   460.30 ms │   470.69 ms │     no change │
│ QQuery 25    │   346.40 ms │   347.84 ms │     no change │
│ QQuery 26    │   451.87 ms │   477.27 ms │  1.06x slower │
│ QQuery 27    │  1650.72 ms │  1684.69 ms │     no change │
│ QQuery 28    │ 12400.61 ms │ 12486.22 ms │     no change │
│ QQuery 29    │   568.75 ms │   554.71 ms │     no change │
│ QQuery 30    │   797.26 ms │   806.62 ms │     no change │
│ QQuery 31    │   822.99 ms │   854.98 ms │     no change │
│ QQuery 32    │  2427.54 ms │  2465.81 ms │     no change │
│ QQuery 33    │  3173.31 ms │  3268.38 ms │     no change │
│ QQuery 34    │  3262.48 ms │  3273.74 ms │     no change │
│ QQuery 35    │  1259.76 ms │  1276.50 ms │     no change │
│ QQuery 36    │   167.14 ms │   173.11 ms │     no change │
│ QQuery 37    │    98.87 ms │   106.56 ms │  1.08x slower │
│ QQuery 38    │   173.53 ms │   171.48 ms │     no change │
│ QQuery 39    │   250.93 ms │   254.95 ms │     no change │
│ QQuery 40    │    73.28 ms │    80.35 ms │  1.10x slower │
│ QQuery 41    │    83.66 ms │    79.01 ms │ +1.06x faster │
│ QQuery 42    │    76.39 ms │    74.95 ms │     no change │
└──────────────┴─────────────┴─────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary          ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (HEAD)          │ 55925.88ms │
│ Total Time (task_budget)   │ 56847.01ms │
│ Average Time (HEAD)        │  1300.60ms │
│ Average Time (task_budget) │  1322.02ms │
│ Queries Faster             │          1 │
│ Queries Slower             │          5 │
│ Queries with No Change     │         37 │
│ Queries with Failure       │          0 │
└────────────────────────────┴────────────┘

@alamb
Copy link
Contributor

alamb commented Jun 18, 2025

I took the liberty of merging up from main to resolve a logical conflict

@Dandandan
Copy link
Contributor

hmm there seems to be some regressions there...

@zhuqi-lucas
Copy link
Contributor

zhuqi-lucas commented Jun 19, 2025

Yeah, the clickbench benchmark shows a little slower, it seems can be reproduced, about total time 1000ms slower. I am not sure if it's a noise.

hmm there seems to be some regressions there...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate datasource Changes to the datasource crate documentation Improvements or additions to documentation optimizer Optimizer rules physical-plan Changes to the physical-plan crate proto Related to proto crate sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Epic] Pipeline breaking cancellation support and improvement
5 participants