Skip to content

Fix bugs and improve performance of ProxyStream #5703

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 34 commits into
base: main
Choose a base branch
from

Conversation

arturaz
Copy link
Collaborator

@arturaz arturaz commented Aug 14, 2025

Noticed while working on #5710.

ProxyStream protocol uses a very small (126 byte) buffer, so there's a lot of byte shuffling going around, especially with the user <-> kernel space crossing. ProxyStream protocol was changed to allow chunk sizes bigger than 126 (up to Int.MaxValue). Additionally, ProxyStream truncated exit codes to a byte, this has been fixed.

As a part of debugging effort ProxyStream was refactored to be more readable.

This has been tested manually. With:

def shout() = Task.Command {
  println("x" * (10 * 1024 * 1024))
}

main branch takes:

real	0m0,540s
user	0m0,399s
sys	0m0,289s

This branch takes:

real	0m0,529s
user	0m0,228s
sys	0m0,094s

@arturaz arturaz requested a review from lihaoyi August 14, 2025 16:35
@lihaoyi
Copy link
Member

lihaoyi commented Aug 14, 2025

  • Can you update the PR description to say how this was tested? Manually is fine, e.g. if you have CLI logs showing that the time to compile a hello-world java module with jvmId set is no longer unreasonable, and the time to ./mill clean && ./mill __.compile on the netty example build with a custom jvmId is down to close the baseline without a custom jvmId (+1-2 seconds due to the ZincWorkerMain startup overhead)

  • Let's split out the ProxyStreams protocol change into a separate PR. Since we hope that will improve performance separately from the Piped*Stream changes, I'd like to see some benchmarks demonstrating that, e.g. def foo = Task{ println("x"*1000000) } streaming faster than it does with the prior protocol. But that's involved enough we should do that in a follow-up PR

  • mill.util.Timed should be private[mill] so we can evolve it in future

@arturaz
Copy link
Collaborator Author

arturaz commented Aug 17, 2025

  • Let's split out the ProxyStreams protocol change into a separate PR. Since we hope that will improve performance separately from the Piped*Stream changes, I'd like to see some benchmarks demonstrating that, e.g. def foo = Task{ println("x"*1000000) } streaming faster than it does with the prior protocol. But that's involved enough we should do that in a follow-up PR

Created #5710

@arturaz arturaz changed the title Fix Zinc worker performance regression Fix Zinc worker performance regression + ProxyStream refactoring Aug 17, 2025
@lihaoyi
Copy link
Member

lihaoyi commented Aug 17, 2025

@arturaz could you run a final end-to-end test on __.compile on the Netty codebase to make sure the timing reported is similar to that of the no-custom-jvmId timing?

@lihaoyi
Copy link
Member

lihaoyi commented Aug 17, 2025

Also please check if the slowdown that @alexarchambault reported in ./mill 'integration.ide[bsp-server].packaged.daemon.testForked' mill.integration.BspServerTests.requestSnapshots is fixed by this PR

@arturaz
Copy link
Collaborator Author

arturaz commented Aug 18, 2025

Also please check if the slowdown that @alexarchambault reported in ./mill 'integration.ide[bsp-server].packaged.daemon.testForked' mill.integration.BspServerTests.requestSnapshots is fixed by this PR

Without the fix: [7402/7402] ============================ integration.ide[bsp-server].packaged.daemon.testForked mill.integration.BspServerTests.requestSnapshots =========================== 400s and counting. cancelled to save time.
With the fix: [7402] + mill.integration.BspServerTests.requestSnapshots 49859ms

@lihaoyi
Copy link
Member

lihaoyi commented Aug 18, 2025

Looks good then!

@lihaoyi
Copy link
Member

lihaoyi commented Aug 18, 2025

I assume #5710 is the one we should merge first right

…r-performance-regression-minimized

# Conflicts:
#	integration/feature/startup-shutdown/src/StartupShutdownTests.scala
#	libs/daemon/server/src/mill/server/Server.scala
@arturaz
Copy link
Collaborator Author

arturaz commented Aug 18, 2025

I assume #5710 is the one we should merge first right

Yes, but I'm still running tests for Netty.

@arturaz
Copy link
Collaborator Author

arturaz commented Aug 18, 2025

@arturaz could you run a final end-to-end test on __.compile on the Netty codebase to make sure the timing reported is similar to that of the no-custom-jvmId timing?

Without JVM id: real 0m23,308s
With JVM id 1st time: real 0m33,452s
With JVM id 2nd time: real 0m22,015s

arturaz and others added 7 commits August 18, 2025 12:00
…into fix/5693-zinc-worker-performance-regression

# Conflicts:
#	libs/daemon/client/src/mill/client/ServerLauncher.java
#	libs/daemon/server/src/mill/server/ProxyStreamServer.scala
#	libs/daemon/server/src/mill/server/Server.scala
#	libs/javalib/worker/src/mill/javalib/zinc/ZincWorkerMain.scala
#	libs/javalib/worker/src/mill/javalib/zinc/ZincWorkerRpcServer.scala
…into fix/5693-zinc-worker-performance-regression
@lefou
Copy link
Member

lefou commented Aug 18, 2025

PR comments for #5710 and this one look pretty much like copy-and-paste. Can they made more significant before merging?

@arturaz
Copy link
Collaborator Author

arturaz commented Aug 18, 2025

PR comments for #5710 and this one look pretty much like copy-and-paste. Can they made more significant before merging?

They are copy-paste, I'll edit this one once #5710 will be merged.

@arturaz arturaz changed the title Fix Zinc worker performance regression + ProxyStream refactoring Draft: fix Zinc worker performance regression + ProxyStream refactoring Aug 18, 2025
@arturaz arturaz marked this pull request as draft August 18, 2025 09:28
@arturaz
Copy link
Collaborator Author

arturaz commented Aug 18, 2025

I'd like to see some benchmarks demonstrating that, e.g. def foo = Task{ println("x"*1000000) } streaming faster than it does with the prior protocol.

With:

def shout() = Task.Command {
  println("x" * (10 * 1024 * 1024))
}

old:

real	0m0,540s
user	0m0,399s
sys	0m0,289s

new:

real	0m0,529s
user	0m0,228s
sys	0m0,094s

AI overview:

  • Real time: There's a minor improvement, with the total elapsed time decreasing from 0.540 seconds to 0.529 seconds. This means the overall execution time that a user would perceive is slightly faster.

  • User time: This shows a very large improvement, dropping dramatically from 0.399 seconds to 0.228 seconds. This indicates a substantial optimization in the code itself, as it now requires much less CPU time to execute its instructions.

  • System time: This is the most significant change. The time spent on kernel-level operations plummeted from 0.289 seconds to just 0.094 seconds. This suggests that the new version is making far fewer or much more efficient system calls (like for I/O operations).

In summary, while the overall "real time" improvement is modest, the underlying efficiency gains are massive. The new version is significantly better optimized, requiring much less CPU time for both its own code (user) and for system-related tasks (sys). This is a clear and substantial performance enhancement.

Lower sys time means less time is spent in the kernel, and the dramatic drop in user time shows your code's logic has become much more efficient.

The reason real time can stay almost the same, even with those significant improvements, is that real time includes time the CPU isn't actively working on your process.

Here's a breakdown of what these timings mean and why this happens:

  • real (Wall-Clock Time): This is the total time that has passed from the moment you start the process to the moment it finishes, just like timing it with a stopwatch. It includes everything: the time the CPU is actively running your code, and also time spent waiting for other things.

  • user (User CPU Time): This is the amount of time the CPU spent executing your program's own code in user-space. The massive reduction here (from 0.399s to 0.228s) is a clear sign of successful code optimization.

  • sys (System CPU Time): This is the time the CPU spent executing kernel-level code on behalf of your program. This could be for tasks like reading or writing files, memory allocation, or other system calls. The huge drop you saw (from 0.289s to 0.094s) means the new version is much more efficient in how it interacts with the operating system.

Why real Time Didn't Drop as Much

The total CPU time (user + sys) for your process dropped significantly:

  • Old: 0.399s (user) + 0.289s (sys) = 0.688s
  • New: 0.228s (user) + 0.094s (sys) = 0.322s

Even though the active CPU work was more than halved, the real time only saw a minor improvement. This indicates that a large portion of the execution time is spent waiting. Your process is likely "I/O bound," meaning its speed is limited by waiting for input/output operations to complete.

Here are the most common reasons for this discrepancy:

  • Waiting for I/O: The program might be waiting for data from a hard drive, a network connection, or user input. During these waits, the CPU is not actively working on your process, so user and sys time don't increase, but the real time clock keeps ticking.
  • Process Scheduling: The operating system may be giving CPU time to other running processes. While other applications are running, your process is paused, and this waiting time contributes to the real time.
  • Sleeping or Deliberate Pauses: If the code has intentional delays (like a sleep command), the process will be inactive, which only adds to the real time.

In conclusion, your optimizations have made the CPU-intensive parts of your program dramatically faster. However, the overall execution time is still dominated by waiting for external resources, which is why the real time hasn't seen a proportional decrease.

@lihaoyi
Copy link
Member

lihaoyi commented Aug 18, 2025

Please don't paste me AI overviews, if I wanted to hear what Gemini has to say about the results I can ask it myself.

…r-performance-regression

# Conflicts:
#	libs/daemon/server/src/mill/server/ProxyStreamServer.scala
@arturaz arturaz changed the title Draft: fix Zinc worker performance regression + ProxyStream refactoring Draft: fix bugs and improve performance of ProxyStream Aug 18, 2025
@arturaz arturaz marked this pull request as ready for review August 18, 2025 11:04
@arturaz arturaz changed the title Draft: fix bugs and improve performance of ProxyStream Fix bugs and improve performance of ProxyStream Aug 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants