Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101

austnwil · 2025-10-02T17:21:57Z

Issue #, if available:

Description of changes:
Release 1.11.8 created a regression in writing deeply nested containers. Patch point handling seemed to be the culprit, and profiling revealed that adding missing patch points to ancestors that don't have them when a child ends up needing one was a particular trouble point - see #1095 for more info. The current implementation uses some recycling data structures to manage the container stack and patch point queue, and while a nice abstraction, using them creates some overhead.

At the cost of some code cleanliness, we can manage the container and patch point lists and object recycling logic directly in the writer to gain a performance boost for select datasets, especially deeply-nested containers. This PR adds the following optimizations:

Drop the RecyclingStack and RecyclingQueue data structures for direct ArrayLists of PatchPoint and ContainerInfo instances, eliminating some overhead with accessing and especially iterating them.
Call ArrayList<>.ensureCapacity() before extending patch point queue instead of just pushing all the ancestor's missing patch points in a loop. This can avoid multiple resizes of the underlying array if there are many nested ancestors missing patch points.
Defer construction of PatchPoint instances until they are actually needed. If ancestors need patch points, give them an index into the queue but leave the actual slot null until they need to set the position and length data.

Below are some benchmarks comparing the current performance (using revision 1991e5b8544fd20c46f4e4ff9c6dec4d6e34e19f) with the performance of this PR. All benchmarks below were ran with ion-java-benchmark-cli on the read-tests-writevalues branch (a quick patch to make the read command test IonWriter.writeValues(IonReader) with stream copy optimization disabled, since there is no existing option in the benchmark CLI to do so). All benchmarks were 10 warmups, 10 iterations, 3 forks. Example command used:

java -jar ./bm-baseline.jar read --mode AverageTime --time-unit microseconds --iterations 10 --warmups 10 --forks 3 --ion-reader non_incremental  --io-type buffer /testfiles/deeply_nested_structs_utf.10n

PDF of these tables since they are a bit compressed:
PatchPoint rewrite benchmarks.pdf

TLDR: 5% - 12% reduction in average time for writing deeply nested containers, ~5% reduction in normalized alloc rate. ~2% reduction in average time for other datasets.

Dataset descriptions:

Dataset	Size	Description
Deeply nested	289K	A single string within 60000 nested structs. Only one top-level value
Real world 1	989K	Data matching a "real world" schema generated from https://github.com/amazon-ion/ion-data-generator
Real world 2	1.0M	Data matching a "real world" schema generated from https://github.com/amazon-ion/ion-data-generator
Real world 3	987K	Data matching a "real world" schema generated from https://github.com/amazon-ion/ion-data-generator
Pres2020 no annotations	1.0M	A few thousand top-level structs, most with a couple levels of container nesting within. Most non-container fields are strings. Converted from a GEDCOM file documenting ancestry of American presidents up to 2020
Pres2020 annotations	1.1M	Same as previous, but with particular fields common across many structs represented as annotations instead. Most structs have one or two annotations
Basic strings	670B	A few basic top-level ASCII and Unicode strings

Environment: Amazon Corretto JDK 21, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory

Dataset	Current (us/op)	This PR (us/op)	% Change	Variance (worst of two)	Normalized alloc rate current (B/op)	Normalized alloc rate, this PR (B/op)	% Change	Variance (worst of two)
Deeply nested	10621.793 ± 112.881	10088.611 ± 80.708	-5.01%	1.06%	11084767.241 ± 110.615	10533190.647 ± 84.077	-4.97%	0.00%
Real world 1	20586.392 ± 1180.709	19054.038 ± 1570.736	-7.44%	8.24%	13899990.722 ± 69157.126	13396635.182 ± 519748.699	-3.62%	3.88%
Real world 2	24480.632 ± 203.719	23874.206 ± 101.660	-2.47%	0.83%	15495084.190 ± 18013.479	15457438.352 ± 40.052	-0.24%	0.12%
Real world 3	24413.800 ± 924.370	23441.727 ± 708.197	-3.98%	3.79%	14415924.878 ± 3380.797	14413980.866 ± 2629.584	-0.01%	0.02%
Pres2020 no annotations	15833.413 ± 83.767	15146.047 ± 81.000	-4.34%	0.53%	10533416.794 ± 31.708	10533257.758 ± 12.070	0.00%	0.00%
Pres2020 annotations	18245.335 ± 447.373	17771.301 ± 320.160	-2.59%	2.45%	11285762.098 ± 199057.399	10960218.230 ± 199051.693	-2.88%	1.82%
Basic strings	1.844 ± 0.011	1.807 ± 0.005	-2.00%	0.60%	9144.609 ± 0.019	8933.948 ± 10.251	-2.30%	0.11%

Environment: Amazon Corretto JDK 17, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory

Dataset	Current (us/op)	This PR (us/op)	% Change	Variance (worst of two)	Normalized alloc rate current (B/op)	Normalized alloc rate, this PR (B/op)	% Change	Variance (worst of two)
Deeply nested	11000.954 ± 37.886	10262.905 ± 17.737	-6.70%	0.34%	11084640.504 ± 111.947	10533035.390 ± 43.513	-4.97%	0.00%
Real world 1	17876.805 ± 341.296	17426.210 ± 196.961	-2.52%	1.91%	13460266.200 ± 546734.423	12891122.188 ± 546751.313	-4.22%	4.24%
Real world 2	27805.232 ± 224.197	26221.252 ± 329.584	-5.69%	1.26%	15974108.776 ± 281970.374	14904709.751 ± 233734.061	-6.69%	1.77%
Real world 3	25149.352 ± 221.853	24811.630 ± 766.999	-1.34%	3.09%	15746212.497 ± 780308.960	13813367.949 ± 287918.900	-12.27%	4.96%
Pres2020 no annotations	16263.373 ± 259.098	16239.790 ± 96.380	-0.14%	1.59%	10036784.757 ± 477421.269	10533338.009 ± 19.498	4.94%	4.76%
Pres2020 annotations	17961.988 ± 38.234	17823.818 ± 232.562	-0.76%	1.30%	10589863.247 ± 78665.450	10589552.437 ± 78647.245	0.00%	0.74%
Basic strings	1.867 ± 0.011	1.831 ± 0.006	-1.92%	0.59%	9160.703 ± 23.065	8936.694 ± 0.019	-2.44%	0.25%

Environment: Amazon Corretto JDK 8, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory

Dataset	Current (us/op)	This PR (us/op)	% Change	Variance (worst of two)	Normalized alloc rate current (B/op)	Normalized alloc rate, this PR (B/op)	% Change	Variance (worst of two)
Deeply nested	8729.400 ± 88.527	7662.881 ± 70.752	-12.21%	1.01%	11083273.433 ± 2.139	10531697.260 ± 1.883	-4.97%	0.00%
Real world 1	16211.947 ± 97.143	16327.516 ± 272.161	0.71%	1.67%	12956962.981 ± 3.958	13525626.997 ± 546720.684	4.38%	4.04%
Real world 2	22379.949 ± 456.448	22278.247 ± 100.963	-0.45%	2.04%	15724645.990 ± 18021.327	15491693.924 ± 205616.897	-1.48%	1.33%
Real world 3	22022.018 ± 125.408	21365.285 ± 250.677	-2.98%	1.17%	14812749.674 ± 538.362	14751036.573 ± 538.783	-0.41%	0.00%
Pres2020 no annotations	14501.792 ± 80.212	14071.542 ± 116.019	-2.96%	0.82%	11165154.384 ± 3.564	11164914.314 ± 3.459	0.00%	0.00%
Pres2020 annotations	16795.880 ± 101.376	16384.975 ± 157.863	-2.44%	0.96%	11385480.438 ± 297.317	11221267.028 ± 78641.134	-1.44%	0.70%
Basic strings	1.697 ± 0.007	1.687 ± 0.006	-0.58%	0.41%	9181.334 ± 10.251	9000.000 ± 0.001	-1.97%	0.11%

Room for improvement

There is still some potential room for further optimizations here. Some additional ideas:

It is not actually necessary to walk down the container stack in addPatchPoint to find the closest ancestor with a patch point; we can keep track of this index and update it whenever we add new patch points. This eliminates some array scanning but was detrimental to datasets that didn't need it in local testing.
It might be possible to craft a more efficient implementation with some direct array management over ArrayList<>. This performed even better for single deeply nested structs when I tried it but caused huge hits in performance to other datasets because of array resizing.
Tweaking the initial size of the patch point list may be fruitful.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

codecov · 2025-10-02T17:48:21Z

Codecov Report

❌ Patch coverage is 92.30769% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.98%. Comparing base (3c1b6b1) to head (c831f05).
⚠️ Report is 127 commits behind head on master.

Files with missing lines	Patch %	Lines
...va/com/amazon/ion/impl/bin/IonRawBinaryWriter.java	92.30%	0 Missing and 4 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##             master    #1101      +/-   ##
============================================
+ Coverage     67.23%   67.98%   +0.74%     
- Complexity     5484     5621     +137     
============================================
  Files           159      160       +1     
  Lines         23025    23289     +264     
  Branches       4126     4176      +50     
============================================
+ Hits          15481    15832     +351     
+ Misses         6262     6165      -97     
- Partials       1282     1292      +10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

austnwil · 2025-10-03T19:00:38Z

A note on the failing write tests:

Many write regression tests fail here because of regressions to gc alloc rate. It is true that overall gc alloc rate/second is higher in this PR, but that is just because of the increases in speed. All else equal, the same benchmarked operation, if ran more times per second, will result in higher memory churn per second.

Taking a closer look at the failing write tests, we can see that normalized gc alloc rate (bytes/operation) is almost constant across all of them, if not slightly lower as a result of this change:

+------------------------------------------------------------------------------------------------+------------------+------------------+----------+------------------+------------------+----------+-----------------------+----------------------+----------+
|                                              Test                                              | Average time old | Average time new | % change |  Alloc rate old  |  Alloc rate new  | % change | Norm. alloc rate old  | Norm. alloc rate new | % change |
+------------------------------------------------------------------------------------------------+------------------+------------------+----------+------------------+------------------+----------+-----------------------+----------------------+----------+
| Detect Regression (nestedStruct, write --api streaming --ion-length-preallocation 1)           | 0.446 ± 0.006    | 0.438 ± 0.009    | -1.79%   | 213.953 ± 2.889  | 217.151 ± 4.279  | 1.49%    | 104979.847 ± 55.600   | 104773.413 ± 88.387  | -0.19%   |
| Detect Regression (nestedStruct, write --api dom --ion-length-preallocation 1)                 | 0.450 ± 0.006    | 0.441 ± 0.008    | -2.00%   | 612.277 ± 8.753  | 623.518 ± 11.362 | 1.83%    | 303094.809 ± 1.309    | 302918.615 ± 1.318   | -0.05%   |
| Detect Regression (nestedList, write --api streaming --ion-length-preallocation 1)             | 0.391 ± 0.006    | 0.386 ± 0.008    | -1.27%   | 550.212 ± 8.791  | 557.424 ± 11.508 | 1.31%    | 237100.475 ± 0.840    | 236876.221 ± 0.735   | -0.09%   |
| Detect Regression (nestedList, write --api dom --ion-length-preallocation 1)                   | 0.404 ± 0.007    | 0.389 ± 0.008    | -3.71%   | 697.308 ± 11.610 | 723.466 ± 15.573 | 3.75%    | 310200.145 ± 1.343    | 310014.151 ± 84.864  | -0.05%   |
| Detect Regression (sexp, write --api streaming --ion-length-preallocation 1)                   | 0.179 ± 0.002    | 0.169 ± 0.001    | -5.58%   | 528.300 ± 5.714  | 557.942 ± 2.485  | 5.61%    | 104158.286 ± 0.428    | 103982.083 ± 0.260   | -0.16%   |
| Detect Regression (sexp, write --api dom --ion-length-preallocation 1)                         | 0.185 ± 0.004    | 0.182 ± 0.011    | -1.62%   | 885.702 ± 18.218 | 903.586 ± 48.087 | 2.01%    | 180338.441 ± 0.443    | 180162.119 ± 0.394   | -0.09%   |
| Detect Regression (realWorldDataSchema01, write --api dom --ion-length-preallocation 1)        | 2.355 ± 0.073    | 2.295 ± 0.078    | -2.54%   | 633.744 ± 19.451 | 650.286 ± 22.002 | 2.61%    | 1642083.686 ± 25.796  | 1641842.208 ± 22.527 | -0.01%   |
| Detect Regression (realWorldDataSchema02, write --api dom --ion-length-preallocation 1)        | 1.308 ± 0.038    | 1.256 ± 0.015    | -3.97%   | 445.675 ± 12.582 | 463.788 ± 5.328  | 4.06%    | 641506.609 ± 30.976   | 641256.388 ± 37.905  | -0.03%   |
| Detect Regression (realWorldDataSchema02, write --api streaming --ion-length-preallocation 1)  | 0.823 ± 0.009    | 0.804 ± 0.010    | -2.30%   | 323.524 ± 3.359  | 330.678 ± 4.041  | 2.21%    | 293090.388 ± 12.227   | 292865.441 ± 11.924  | -0.07%   |
| Detect Regression (realWorldDataSchema03, write --api streaming --ion-length-preallocation 1)  | 0.629 ± 0.010    | 0.607 ± 0.008    | -3.49%   | 449.738 ± 7.424  | 465.880 ± 6.359  | 3.58%    | 311380.638 ± 1.938    | 311155.548 ± 1.921   | -0.07%   |
| Detect Regression (realWorldDataSchema03, write --api dom --ion-length-preallocation 1)        | 0.677 ± 0.007    | 0.651 ± 0.008    | -3.84%   | 587.127 ± 6.301  | 607.409 ± 7.509  | 3.45%    | 437673.136 ± 1666.447 | 435585.753 ± 42.217  | -0.47%   |
+------------------------------------------------------------------------------------------------+------------------+------------------+----------+------------------+------------------+----------+-----------------------+----------------------+----------+

Table was too wide for gh's markdown so I used a scrollable code block.

Also notice that increases in total alloc rate are proportional to speedups in execution time.

austnwil added 7 commits September 30, 2025 20:37

Remove RecyclingStack - manage container stack directly in writer

0e835c6

Remove RecyclingQueue - manage PatchPoint list directly in writer

35f4649

Set version for local testing

8371315

Delete unused data structures and imports

8b2aa2e

Code cleanup

2afe874

Add code documentation

933f00b

Fix version

d7acc96

austnwil added 3 commits October 3, 2025 10:13

Whitespace consistency fixes

560b0ef

Simplify container push logic

e811137

Delete unused function

c831f05

austnwil marked this pull request as ready for review October 3, 2025 19:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101

Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101

Uh oh!

austnwil commented Oct 2, 2025 •

edited

Loading

Uh oh!

codecov bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

austnwil commented Oct 3, 2025 •

edited

Loading

Uh oh!

Uh oh!

Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101

Are you sure you want to change the base?

Improve write performance in some cases by managing PatchPoints and containers directly in writer #1101

Uh oh!

Conversation

austnwil commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Room for improvement

Uh oh!

codecov bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

austnwil commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

austnwil commented Oct 2, 2025 •

edited

Loading

codecov bot commented Oct 2, 2025 •

edited

Loading

austnwil commented Oct 3, 2025 •

edited

Loading