Skip to content

Conversation

austnwil
Copy link
Contributor

@austnwil austnwil commented Oct 2, 2025

Issue #, if available:

#1095

Description of changes:
Release 1.11.8 created a regression in writing deeply nested containers. Patch point handling seemed to be the culprit, and profiling revealed that adding missing patch points to ancestors that don't have them when a child ends up needing one was a particular trouble point - see #1095 for more info. The current implementation uses some recycling data structures to manage the container stack and patch point queue, and while a nice abstraction, using them creates some overhead.

At the cost of some code cleanliness, we can manage the container and patch point lists and object recycling logic directly in the writer to gain a performance boost for select datasets, especially deeply-nested containers. This PR adds the following optimizations:

  • Drop the RecyclingStack and RecyclingQueue data structures for direct ArrayLists of PatchPoint and ContainerInfo instances, eliminating some overhead with accessing and especially iterating them.
  • Call ArrayList<>.ensureCapacity() before extending patch point queue instead of just pushing all the ancestor's missing patch points in a loop. This can avoid multiple resizes of the underlying array if there are many nested ancestors missing patch points.
  • Defer construction of PatchPoint instances until they are actually needed. If ancestors need patch points, give them an index into the queue but leave the actual slot null until they need to set the position and length data.

Below are some benchmarks comparing the current performance (using revision 1991e5b8544fd20c46f4e4ff9c6dec4d6e34e19f) with the performance of this PR. All benchmarks below were ran with ion-java-benchmark-cli on the read-tests-writevalues branch (a quick patch to make the read command test IonWriter.writeValues(IonReader) with stream copy optimization disabled, since there is no existing option in the benchmark CLI to do so). All benchmarks were 10 warmups, 10 iterations, 3 forks. Example command used:

java -jar ./bm-baseline.jar read --mode AverageTime --time-unit microseconds --iterations 10 --warmups 10 --forks 3 --ion-reader non_incremental  --io-type buffer /testfiles/deeply_nested_structs_utf.10n

PDF of these tables since they are a bit compressed:
PatchPoint rewrite benchmarks.pdf

TLDR: 5% - 12% reduction in average time for writing deeply nested containers, ~5% reduction in normalized alloc rate. ~2% reduction in average time for other datasets.

Dataset descriptions:

Dataset Size Description
Deeply nested 289K A single string within 60000 nested structs. Only one top-level value
Real world 1 989K Data matching a "real world" schema generated from https://github.com/amazon-ion/ion-data-generator
Real world 2 1.0M Data matching a "real world" schema generated from https://github.com/amazon-ion/ion-data-generator
Real world 3 987K Data matching a "real world" schema generated from https://github.com/amazon-ion/ion-data-generator
Pres2020 no annotations 1.0M A few thousand top-level structs, most with a couple levels of container nesting within. Most non-container fields are strings. Converted from a GEDCOM file documenting ancestry of American presidents up to 2020
Pres2020 annotations 1.1M Same as previous, but with particular fields common across many structs represented as annotations instead. Most structs have one or two annotations
Basic strings 670B A few basic top-level ASCII and Unicode strings

Environment: Amazon Corretto JDK 21, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory

Dataset Current (us/op) This PR (us/op) % Change Variance (worst of two) Normalized alloc rate current (B/op) Normalized alloc rate, this PR (B/op) % Change Variance (worst of two)
Deeply nested 10621.793 ±  112.881 10088.611 ±    80.708 -5.01% 1.06% 11084767.241 ± 110.615 10533190.647 ± 84.077 -4.97% 0.00%
Real world 1 20586.392 ±  1180.709 19054.038 ±  1570.736 -7.44% 8.24% 13899990.722 ± 69157.126 13396635.182 ± 519748.699 -3.62% 3.88%
Real world 2 24480.632 ±  203.719 23874.206 ±   101.660 -2.47% 0.83% 15495084.190 ± 18013.479 15457438.352 ± 40.052 -0.24% 0.12%
Real world 3 24413.800 ±  924.370 23441.727 ±    708.197 -3.98% 3.79% 14415924.878 ± 3380.797 14413980.866 ± 2629.584 -0.01% 0.02%
Pres2020 no annotations 15833.413 ±  83.767 15146.047 ±  81.000 -4.34% 0.53% 10533416.794 ± 31.708 10533257.758 ± 12.070 0.00% 0.00%
Pres2020 annotations 18245.335 ±  447.373 17771.301 ±    320.160 -2.59% 2.45% 11285762.098 ± 199057.399 10960218.230 ± 199051.693 -2.88% 1.82%
Basic strings 1.844 ± 0.011 1.807 ± 0.005 -2.00% 0.60% 9144.609 ± 0.019 8933.948 ± 10.251 -2.30% 0.11%

Environment: Amazon Corretto JDK 17, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory

Dataset Current (us/op) This PR (us/op) % Change Variance (worst of two) Normalized alloc rate current (B/op) Normalized alloc rate, this PR (B/op) % Change Variance (worst of two)
Deeply nested 11000.954 ±  37.886 10262.905 ±  17.737 -6.70% 0.34% 11084640.504 ± 111.947 10533035.390 ± 43.513 -4.97% 0.00%
Real world 1 17876.805 ±  341.296 17426.210 ±    196.961 -2.52% 1.91% 13460266.200 ± 546734.423 12891122.188 ± 546751.313 -4.22% 4.24%
Real world 2 27805.232 ±  224.197 26221.252 ±  329.584 -5.69% 1.26% 15974108.776 ± 281970.374 14904709.751 ± 233734.061 -6.69% 1.77%
Real world 3 25149.352 ±  221.853 24811.630 ±  766.999 -1.34% 3.09% 15746212.497 ± 780308.960 13813367.949 ± 287918.900 -12.27% 4.96%
Pres2020 no annotations 16263.373 ±  259.098 16239.790 ±  96.380 -0.14% 1.59% 10036784.757 ± 477421.269 10533338.009 ± 19.498 4.94% 4.76%
Pres2020 annotations 17961.988 ±  38.234 17823.818 ±  232.562 -0.76% 1.30% 10589863.247 ± 78665.450 10589552.437 ± 78647.245 0.00% 0.74%
Basic strings 1.867 ± 0.011 1.831 ± 0.006 -1.92% 0.59% 9160.703 ± 23.065 8936.694 ± 0.019 -2.44% 0.25%

Environment: Amazon Corretto JDK 8, x64 architecture, Alpine linux in Docker via WSL on Windows 11 host, CPU 4 cores 8 threads, 32 GB main memory

Dataset Current (us/op) This PR (us/op) % Change Variance (worst of two) Normalized alloc rate current (B/op) Normalized alloc rate, this PR (B/op) % Change Variance (worst of two)
Deeply nested 8729.400 ±  88.527 7662.881 ±  70.752 -12.21% 1.01% 11083273.433 ± 2.139 10531697.260 ± 1.883 -4.97% 0.00%
Real world 1 16211.947 ±  97.143 16327.516 ±  272.161 0.71% 1.67% 12956962.981 ± 3.958 13525626.997 ± 546720.684 4.38% 4.04%
Real world 2 22379.949 ±  456.448 22278.247 ±  100.963 -0.45% 2.04% 15724645.990 ± 18021.327 15491693.924 ± 205616.897 -1.48% 1.33%
Real world 3 22022.018 ±  125.408 21365.285 ±  250.677 -2.98% 1.17% 14812749.674 ± 538.362 14751036.573 ± 538.783 -0.41% 0.00%
Pres2020 no annotations 14501.792 ±  80.212 14071.542 ±  116.019 -2.96% 0.82% 11165154.384 ± 3.564 11164914.314 ± 3.459 0.00% 0.00%
Pres2020 annotations 16795.880 ±  101.376 16384.975 ±  157.863 -2.44% 0.96% 11385480.438 ± 297.317 11221267.028 ± 78641.134 -1.44% 0.70%
Basic strings 1.697 ± 0.007 1.687 ± 0.006 -0.58% 0.41% 9181.334 ± 10.251 9000.000 ± 0.001 -1.97% 0.11%

Room for improvement

There is still some potential room for further optimizations here. Some additional ideas:

  • It is not actually necessary to walk down the container stack in addPatchPoint to find the closest ancestor with a patch point; we can keep track of this index and update it whenever we add new patch points. This eliminates some array scanning but was detrimental to datasets that didn't need it in local testing.
  • It might be possible to craft a more efficient implementation with some direct array management over ArrayList<>. This performed even better for single deeply nested structs when I tried it but caused huge hits in performance to other datasets because of array resizing.
  • Tweaking the initial size of the patch point list may be fruitful.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Copy link

codecov bot commented Oct 2, 2025

Codecov Report

❌ Patch coverage is 92.30769% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 67.98%. Comparing base (3c1b6b1) to head (c831f05).
⚠️ Report is 127 commits behind head on master.

Files with missing lines Patch % Lines
...va/com/amazon/ion/impl/bin/IonRawBinaryWriter.java 92.30% 0 Missing and 4 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master    #1101      +/-   ##
============================================
+ Coverage     67.23%   67.98%   +0.74%     
- Complexity     5484     5621     +137     
============================================
  Files           159      160       +1     
  Lines         23025    23289     +264     
  Branches       4126     4176      +50     
============================================
+ Hits          15481    15832     +351     
+ Misses         6262     6165      -97     
- Partials       1282     1292      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@austnwil
Copy link
Contributor Author

austnwil commented Oct 3, 2025

A note on the failing write tests:

Many write regression tests fail here because of regressions to gc alloc rate. It is true that overall gc alloc rate/second is higher in this PR, but that is just because of the increases in speed. All else equal, the same benchmarked operation, if ran more times per second, will result in higher memory churn per second.

Taking a closer look at the failing write tests, we can see that normalized gc alloc rate (bytes/operation) is almost constant across all of them, if not slightly lower as a result of this change:

+------------------------------------------------------------------------------------------------+------------------+------------------+----------+------------------+------------------+----------+-----------------------+----------------------+----------+
|                                              Test                                              | Average time old | Average time new | % change |  Alloc rate old  |  Alloc rate new  | % change | Norm. alloc rate old  | Norm. alloc rate new | % change |
+------------------------------------------------------------------------------------------------+------------------+------------------+----------+------------------+------------------+----------+-----------------------+----------------------+----------+
| Detect Regression (nestedStruct, write --api streaming --ion-length-preallocation 1)           | 0.446 ± 0.006    | 0.438 ± 0.009    | -1.79%   | 213.953 ± 2.889  | 217.151 ± 4.279  | 1.49%    | 104979.847 ± 55.600   | 104773.413 ± 88.387  | -0.19%   |
| Detect Regression (nestedStruct, write --api dom --ion-length-preallocation 1)                 | 0.450 ± 0.006    | 0.441 ± 0.008    | -2.00%   | 612.277 ± 8.753  | 623.518 ± 11.362 | 1.83%    | 303094.809 ± 1.309    | 302918.615 ± 1.318   | -0.05%   |
| Detect Regression (nestedList, write --api streaming --ion-length-preallocation 1)             | 0.391 ± 0.006    | 0.386 ± 0.008    | -1.27%   | 550.212 ± 8.791  | 557.424 ± 11.508 | 1.31%    | 237100.475 ± 0.840    | 236876.221 ± 0.735   | -0.09%   |
| Detect Regression (nestedList, write --api dom --ion-length-preallocation 1)                   | 0.404 ± 0.007    | 0.389 ± 0.008    | -3.71%   | 697.308 ± 11.610 | 723.466 ± 15.573 | 3.75%    | 310200.145 ± 1.343    | 310014.151 ± 84.864  | -0.05%   |
| Detect Regression (sexp, write --api streaming --ion-length-preallocation 1)                   | 0.179 ± 0.002    | 0.169 ± 0.001    | -5.58%   | 528.300 ± 5.714  | 557.942 ± 2.485  | 5.61%    | 104158.286 ± 0.428    | 103982.083 ± 0.260   | -0.16%   |
| Detect Regression (sexp, write --api dom --ion-length-preallocation 1)                         | 0.185 ± 0.004    | 0.182 ± 0.011    | -1.62%   | 885.702 ± 18.218 | 903.586 ± 48.087 | 2.01%    | 180338.441 ± 0.443    | 180162.119 ± 0.394   | -0.09%   |
| Detect Regression (realWorldDataSchema01, write --api dom --ion-length-preallocation 1)        | 2.355 ± 0.073    | 2.295 ± 0.078    | -2.54%   | 633.744 ± 19.451 | 650.286 ± 22.002 | 2.61%    | 1642083.686 ± 25.796  | 1641842.208 ± 22.527 | -0.01%   |
| Detect Regression (realWorldDataSchema02, write --api dom --ion-length-preallocation 1)        | 1.308 ± 0.038    | 1.256 ± 0.015    | -3.97%   | 445.675 ± 12.582 | 463.788 ± 5.328  | 4.06%    | 641506.609 ± 30.976   | 641256.388 ± 37.905  | -0.03%   |
| Detect Regression (realWorldDataSchema02, write --api streaming --ion-length-preallocation 1)  | 0.823 ± 0.009    | 0.804 ± 0.010    | -2.30%   | 323.524 ± 3.359  | 330.678 ± 4.041  | 2.21%    | 293090.388 ± 12.227   | 292865.441 ± 11.924  | -0.07%   |
| Detect Regression (realWorldDataSchema03, write --api streaming --ion-length-preallocation 1)  | 0.629 ± 0.010    | 0.607 ± 0.008    | -3.49%   | 449.738 ± 7.424  | 465.880 ± 6.359  | 3.58%    | 311380.638 ± 1.938    | 311155.548 ± 1.921   | -0.07%   |
| Detect Regression (realWorldDataSchema03, write --api dom --ion-length-preallocation 1)        | 0.677 ± 0.007    | 0.651 ± 0.008    | -3.84%   | 587.127 ± 6.301  | 607.409 ± 7.509  | 3.45%    | 437673.136 ± 1666.447 | 435585.753 ± 42.217  | -0.47%   |
+------------------------------------------------------------------------------------------------+------------------+------------------+----------+------------------+------------------+----------+-----------------------+----------------------+----------+

Table was too wide for gh's markdown so I used a scrollable code block.

Also notice that increases in total alloc rate are proportional to speedups in execution time.

@austnwil austnwil marked this pull request as ready for review October 3, 2025 19:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant