-
Notifications
You must be signed in to change notification settings - Fork 121
Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #1101 +/- ##
============================================
+ Coverage 67.23% 67.98% +0.74%
- Complexity 5484 5621 +137
============================================
Files 159 160 +1
Lines 23025 23289 +264
Branches 4126 4176 +50
============================================
+ Hits 15481 15832 +351
+ Misses 6262 6165 -97
- Partials 1282 1292 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
A note on the failing write tests: Many write regression tests fail here because of regressions to gc alloc rate. It is true that overall gc alloc rate/second is higher in this PR, but that is just because of the increases in speed. All else equal, the same benchmarked operation, if ran more times per second, will result in higher memory churn per second. Taking a closer look at the failing write tests, we can see that normalized gc alloc rate (bytes/operation) is almost constant across all of them, if not slightly lower as a result of this change:
Table was too wide for gh's markdown so I used a scrollable code block. Also notice that increases in total alloc rate are proportional to speedups in execution time. |
Issue #, if available:
#1095
Description of changes:
Release 1.11.8 created a regression in writing deeply nested containers. Patch point handling seemed to be the culprit, and profiling revealed that adding missing patch points to ancestors that don't have them when a child ends up needing one was a particular trouble point - see #1095 for more info. The current implementation uses some recycling data structures to manage the container stack and patch point queue, and while a nice abstraction, using them creates some overhead.
At the cost of some code cleanliness, we can manage the container and patch point lists and object recycling logic directly in the writer to gain a performance boost for select datasets, especially deeply-nested containers. This PR adds the following optimizations:
RecyclingStack
andRecyclingQueue
data structures for directArrayList
s ofPatchPoint
andContainerInfo
instances, eliminating some overhead with accessing and especially iterating them.ArrayList<>.ensureCapacity()
before extending patch point queue instead of just pushing all the ancestor's missing patch points in a loop. This can avoid multiple resizes of the underlying array if there are many nested ancestors missing patch points.PatchPoint
instances until they are actually needed. If ancestors need patch points, give them an index into the queue but leave the actual slot null until they need to set the position and length data.Below are some benchmarks comparing the current performance (using revision
1991e5b8544fd20c46f4e4ff9c6dec4d6e34e19f
) with the performance of this PR. All benchmarks below were ran with ion-java-benchmark-cli on the read-tests-writevalues branch (a quick patch to make theread
command testIonWriter.writeValues(IonReader)
with stream copy optimization disabled, since there is no existing option in the benchmark CLI to do so). All benchmarks were 10 warmups, 10 iterations, 3 forks. Example command used:PDF of these tables since they are a bit compressed:
PatchPoint rewrite benchmarks.pdf
TLDR: 5% - 12% reduction in average time for writing deeply nested containers, ~5% reduction in normalized alloc rate. ~2% reduction in average time for other datasets.
Dataset descriptions:
Environment: Amazon Corretto JDK 21, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory
Environment: Amazon Corretto JDK 17, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory
Environment: Amazon Corretto JDK 8, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory
Room for improvement
There is still some potential room for further optimizations here. Some additional ideas:
addPatchPoint
to find the closest ancestor with a patch point; we can keep track of this index and update it whenever we add new patch points. This eliminates some array scanning but was detrimental to datasets that didn't need it in local testing.ArrayList<>
. This performed even better for single deeply nested structs when I tried it but caused huge hits in performance to other datasets because of array resizing.By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.