Fix bugs and improve performance of `ProxyStream` #5703

arturaz · 2025-08-14T16:32:48Z

Noticed while working on #5710.

ProxyStream protocol uses a very small (126 byte) buffer, so there's a lot of byte shuffling going around, especially with the user <-> kernel space crossing. ProxyStream protocol was changed to allow chunk sizes bigger than 126 (up to Int.MaxValue). Additionally, ProxyStream truncated exit codes to a byte, this has been fixed.

As a part of debugging effort ProxyStream was refactored to be more readable.

This has been tested manually. With:

def shout() = Task.Command {
  println("x" * (10 * 1024 * 1024))
}

main branch takes:

real	0m0,540s
user	0m0,399s
sys	0m0,289s

This branch takes:

real	0m0,529s
user	0m0,228s
sys	0m0,094s

…r-performance-regression

lihaoyi · 2025-08-14T18:00:01Z

Can you update the PR description to say how this was tested? Manually is fine, e.g. if you have CLI logs showing that the time to compile a hello-world java module with jvmId set is no longer unreasonable, and the time to ./mill clean && ./mill __.compile on the netty example build with a custom jvmId is down to close the baseline without a custom jvmId (+1-2 seconds due to the ZincWorkerMain startup overhead)
Let's split out the ProxyStreams protocol change into a separate PR. Since we hope that will improve performance separately from the Piped*Stream changes, I'd like to see some benchmarks demonstrating that, e.g. def foo = Task{ println("x"*1000000) } streaming faster than it does with the prior protocol. But that's involved enough we should do that in a follow-up PR
mill.util.Timed should be private[mill] so we can evolve it in future

libs/daemon/client/src/mill/client/ServerLauncher.java

libs/daemon/server/src/mill/server/Server.scala

libs/daemon/client/src/mill/client/ClientUtil.java

arturaz · 2025-08-17T09:26:43Z

Let's split out the ProxyStreams protocol change into a separate PR. Since we hope that will improve performance separately from the Piped*Stream changes, I'd like to see some benchmarks demonstrating that, e.g. def foo = Task{ println("x"*1000000) } streaming faster than it does with the prior protocol. But that's involved enough we should do that in a follow-up PR

Created #5710

lihaoyi · 2025-08-17T11:35:03Z

@arturaz could you run a final end-to-end test on __.compile on the Netty codebase to make sure the timing reported is similar to that of the no-custom-jvmId timing?

lihaoyi · 2025-08-17T12:59:19Z

Also please check if the slowdown that @alexarchambault reported in ./mill 'integration.ide[bsp-server].packaged.daemon.testForked' mill.integration.BspServerTests.requestSnapshots is fixed by this PR

…erformance-regression-minimized

…r-performance-regression

arturaz · 2025-08-18T07:57:56Z

Also please check if the slowdown that @alexarchambault reported in ./mill 'integration.ide[bsp-server].packaged.daemon.testForked' mill.integration.BspServerTests.requestSnapshots is fixed by this PR

Without the fix: [7402/7402] ============================ integration.ide[bsp-server].packaged.daemon.testForked mill.integration.BspServerTests.requestSnapshots =========================== 400s and counting. cancelled to save time.
With the fix: [7402] + mill.integration.BspServerTests.requestSnapshots 49859ms

lihaoyi · 2025-08-18T08:02:06Z

Looks good then!

lihaoyi · 2025-08-18T08:03:04Z

I assume #5710 is the one we should merge first right

…r-performance-regression-minimized # Conflicts: # integration/feature/startup-shutdown/src/StartupShutdownTests.scala # libs/daemon/server/src/mill/server/Server.scala

arturaz · 2025-08-18T08:13:56Z

I assume #5710 is the one we should merge first right

Yes, but I'm still running tests for Netty.

arturaz · 2025-08-18T08:58:18Z

@arturaz could you run a final end-to-end test on __.compile on the Netty codebase to make sure the timing reported is similar to that of the no-custom-jvmId timing?

Without JVM id: real 0m23,308s
With JVM id 1st time: real 0m33,452s
With JVM id 2nd time: real 0m22,015s

…into fix/5693-zinc-worker-performance-regression # Conflicts: # libs/daemon/client/src/mill/client/ServerLauncher.java # libs/daemon/server/src/mill/server/ProxyStreamServer.scala # libs/daemon/server/src/mill/server/Server.scala # libs/javalib/worker/src/mill/javalib/zinc/ZincWorkerMain.scala # libs/javalib/worker/src/mill/javalib/zinc/ZincWorkerRpcServer.scala

https://github.com/arturaz/mill into fix/5693-zinc-worker-performance-regression-minimized

…into fix/5693-zinc-worker-performance-regression

lefou · 2025-08-18T09:24:32Z

PR comments for #5710 and this one look pretty much like copy-and-paste. Can they made more significant before merging?

arturaz · 2025-08-18T09:28:26Z

PR comments for #5710 and this one look pretty much like copy-and-paste. Can they made more significant before merging?

They are copy-paste, I'll edit this one once #5710 will be merged.

arturaz · 2025-08-18T09:57:43Z

I'd like to see some benchmarks demonstrating that, e.g. def foo = Task{ println("x"*1000000) } streaming faster than it does with the prior protocol.

With:

def shout() = Task.Command {
  println("x" * (10 * 1024 * 1024))
}

old:

real	0m0,540s
user	0m0,399s
sys	0m0,289s

new:

real	0m0,529s
user	0m0,228s
sys	0m0,094s

AI overview:

Real time: There's a minor improvement, with the total elapsed time decreasing from 0.540 seconds to 0.529 seconds. This means the overall execution time that a user would perceive is slightly faster.
User time: This shows a very large improvement, dropping dramatically from 0.399 seconds to 0.228 seconds. This indicates a substantial optimization in the code itself, as it now requires much less CPU time to execute its instructions.
System time: This is the most significant change. The time spent on kernel-level operations plummeted from 0.289 seconds to just 0.094 seconds. This suggests that the new version is making far fewer or much more efficient system calls (like for I/O operations).

In summary, while the overall "real time" improvement is modest, the underlying efficiency gains are massive. The new version is significantly better optimized, requiring much less CPU time for both its own code (user) and for system-related tasks (sys). This is a clear and substantial performance enhancement.

Lower sys time means less time is spent in the kernel, and the dramatic drop in user time shows your code's logic has become much more efficient.

The reason real time can stay almost the same, even with those significant improvements, is that real time includes time the CPU isn't actively working on your process.

Here's a breakdown of what these timings mean and why this happens:

real (Wall-Clock Time): This is the total time that has passed from the moment you start the process to the moment it finishes, just like timing it with a stopwatch. It includes everything: the time the CPU is actively running your code, and also time spent waiting for other things.
user (User CPU Time): This is the amount of time the CPU spent executing your program's own code in user-space. The massive reduction here (from 0.399s to 0.228s) is a clear sign of successful code optimization.
sys (System CPU Time): This is the time the CPU spent executing kernel-level code on behalf of your program. This could be for tasks like reading or writing files, memory allocation, or other system calls. The huge drop you saw (from 0.289s to 0.094s) means the new version is much more efficient in how it interacts with the operating system.

Why `real` Time Didn't Drop as Much

The total CPU time (user + sys) for your process dropped significantly:

Old: 0.399s (user) + 0.289s (sys) = 0.688s
New: 0.228s (user) + 0.094s (sys) = 0.322s

Even though the active CPU work was more than halved, the real time only saw a minor improvement. This indicates that a large portion of the execution time is spent waiting. Your process is likely "I/O bound," meaning its speed is limited by waiting for input/output operations to complete.

Here are the most common reasons for this discrepancy:

Waiting for I/O: The program might be waiting for data from a hard drive, a network connection, or user input. During these waits, the CPU is not actively working on your process, so user and sys time don't increase, but the real time clock keeps ticking.
Process Scheduling: The operating system may be giving CPU time to other running processes. While other applications are running, your process is paused, and this waiting time contributes to the real time.
Sleeping or Deliberate Pauses: If the code has intentional delays (like a sleep command), the process will be inactive, which only adds to the real time.

In conclusion, your optimizations have made the CPU-intensive parts of your program dramatically faster. However, the overall execution time is still dominated by waiting for external resources, which is why the real time hasn't seen a proportional decrease.

lihaoyi · 2025-08-18T10:03:12Z

Please don't paste me AI overviews, if I wanted to hear what Gemini has to say about the results I can ask it myself.

…r-performance-regression # Conflicts: # libs/daemon/server/src/mill/server/ProxyStreamServer.scala

arturaz added 11 commits August 13, 2025 16:57

Extra logging to files for JvmWorker/ZincWorker

3d586f8

WIP: cleaning up stream pumpers.

352fa72

Things seem to work.

9758861

Things seem to work for real now.

9dbb65b

Merge remote-tracking branch 'upstream/main' into fix/5693-zinc-worke…

014acbc

…r-performance-regression

WIP

f9ae7e4

WIP: debugging

2db6ef7

JvmWorkerImpl: a version without pumpers

f19bad6

move code

aafd0b6

Remove stream debugging.

ca69b6c

Optimize imports.

63a79ec

arturaz requested a review from lihaoyi August 14, 2025 16:35

[autofix.ci] apply automated fixes

1257817

lihaoyi reviewed Aug 14, 2025

View reviewed changes

libs/daemon/client/src/mill/client/ServerLauncher.java Outdated Show resolved Hide resolved

lihaoyi reviewed Aug 14, 2025

View reviewed changes

libs/daemon/client/src/mill/client/ServerLauncher.java Outdated Show resolved Hide resolved

lihaoyi reviewed Aug 14, 2025

View reviewed changes

libs/daemon/server/src/mill/server/Server.scala Outdated Show resolved Hide resolved

lihaoyi reviewed Aug 14, 2025

View reviewed changes

libs/daemon/client/src/mill/client/ClientUtil.java Outdated Show resolved Hide resolved

arturaz and others added 5 commits August 17, 2025 11:49

Update golden tests.

1b6c4f6

binary compatibility

fdd3543

Make Timed private[mill]

b860b24

[autofix.ci] apply automated fixes

75ba990

Minimized version.

56534b7

arturaz changed the title ~~Fix Zinc worker performance regression~~ Fix Zinc worker performance regression + ProxyStream refactoring Aug 17, 2025

[autofix.ci] apply automated fixes

82cad20

arturaz added 2 commits August 18, 2025 09:12

Updated tests.

559a39f

Fix.

a3e34e5

arturaz added 4 commits August 18, 2025 09:58

Merge branch 'fix/rapid-startup-shutdown' into fix/5693-zinc-worker-p…

f8b7184

…erformance-regression-minimized

Update tests.

bfa08c3

Merge remote-tracking branch 'upstream/main' into fix/5693-zinc-worke…

9535ae9

…r-performance-regression

Merge remote-tracking branch 'upstream/main' into fix/5693-zinc-worke…

1cdada1

…r-performance-regression

Merge remote-tracking branch 'upstream/main' into fix/5693-zinc-worke…

dd7b47b

…r-performance-regression-minimized # Conflicts: # integration/feature/startup-shutdown/src/StartupShutdownTests.scala # libs/daemon/server/src/mill/server/Server.scala

arturaz and others added 7 commits August 18, 2025 12:00

Improvements.

14a2986

Comments.

b3bb068

[autofix.ci] apply automated fixes

65086d2

Fix compilation on JDK 11.

98a598a

Merge branch 'fix/5693-zinc-worker-performance-regression-minimized' of

3cf6d4e

https://github.com/arturaz/mill into fix/5693-zinc-worker-performance-regression-minimized

Merge branch 'fix/5693-zinc-worker-performance-regression-minimized' …

d058814

…into fix/5693-zinc-worker-performance-regression

arturaz changed the title ~~Fix Zinc worker performance regression + ProxyStream refactoring~~ Draft: fix Zinc worker performance regression + ProxyStream refactoring Aug 18, 2025

arturaz marked this pull request as draft August 18, 2025 09:28

arturaz added 2 commits August 18, 2025 13:52

Merge remote-tracking branch 'upstream/main' into fix/5693-zinc-worke…

c454e2f

…r-performance-regression # Conflicts: # libs/daemon/server/src/mill/server/ProxyStreamServer.scala

PR updates.

fbf3c3e

arturaz changed the title ~~Draft: fix Zinc worker performance regression + ProxyStream refactoring~~ Draft: fix bugs and improve performance of ProxyStream Aug 18, 2025

arturaz marked this pull request as ready for review August 18, 2025 11:04

arturaz changed the title ~~Draft: fix bugs and improve performance of ProxyStream~~ Fix bugs and improve performance of ProxyStream Aug 18, 2025

Uh oh!

Fix bugs and improve performance of ProxyStream #5703

Are you sure you want to change the base?

Fix bugs and improve performance of ProxyStream #5703

Uh oh!

Conversation

arturaz commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lihaoyi commented Aug 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arturaz commented Aug 17, 2025

Uh oh!

lihaoyi commented Aug 17, 2025

Uh oh!

lihaoyi commented Aug 17, 2025

Uh oh!

arturaz commented Aug 18, 2025

Uh oh!

lihaoyi commented Aug 18, 2025

Uh oh!

lihaoyi commented Aug 18, 2025

Uh oh!

arturaz commented Aug 18, 2025

Uh oh!

arturaz commented Aug 18, 2025

Uh oh!

lefou commented Aug 18, 2025

Uh oh!

arturaz commented Aug 18, 2025

Uh oh!

arturaz commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why real Time Didn't Drop as Much

Uh oh!

lihaoyi commented Aug 18, 2025

Uh oh!

Uh oh!

Fix bugs and improve performance of `ProxyStream` #5703

Fix bugs and improve performance of `ProxyStream` #5703

arturaz commented Aug 14, 2025 •

edited

Loading

arturaz commented Aug 18, 2025 •

edited

Loading

Why `real` Time Didn't Drop as Much