[WIP] Broker cache refactoring and add `CacheEvictionByExpectedReadCount` eviction strategy #207

lhotari · 2025-05-26T12:57:52Z

Motivation

Pulsar's broker cache contains multiple gaps as described in an email
https://lists.apache.org/thread/xm095hnjo0cffbdy8ckysmzzm90gsbnp
There's also a slightly related issue apache#23466

Status

These changes are WIP and many tests might still fail. This work was started in October 2024 and has been rebased several times.

Modifications

refactor the cache implementation
- remove unnecessary generics usage from RangeCache
add new cache eviction strategy CacheEvictionByExpectedReadCount that is based on expected read count of entries
- when an entry is read by a cursor, there are usually more cursors that will be reading the same entry very soon from the cache. This solution ensures that the later cursors will find the entry in the cache unless the cache entry has expired by size or time.
replace the per-cache eviction to be handled by a single shared eviction queue
- each ledger has it's own cache instance and previously each cache instance eviction was handled separately. This was a limiting factor in adding the new CacheEvictionByExpectedReadCount strategy

UPDATE about progress

Changes will be split into at least 2 PRs.

Broker cache eviction refactoring and use of single queue for removal
- PR created upstream in [refactor][ml] Replace cache eviction algorithm with centralized removal queue and job apache/pulsar#24363.
- Mailing list discussion thread: https://lists.apache.org/thread/ddzzc17b0c218ozq9tx0r3rx5sgljfb0
expected read count based retention of entries in the cache
- WIP in [WIP] Add cacheEvictionByExpectedReadCount solution #209

Documentation

doc
doc-required
doc-not-needed
doc-complete

…size

…bled

…viction is based on expected read count

…gible for eviction

berg223 · 2025-05-28T04:10:56Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeEntryCacheImpl.java

                                for (LedgerEntry e : ledgerEntries) {
-                                    EntryImpl entry = RangeEntryCacheManagerImpl.create(e, interceptor);
+                                    int expectedReadCountVal = expectedReadCount.getAsInt();


Can we place this line before the for loop?

berg223 · 2025-05-28T06:46:18Z

...ed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeCacheRemovalQueue.java

+
+    public Pair<Integer, Long> evictLEntriesBeforeTimestamp(long timestampNanos) {
+        return evictEntries(
+                (e, c) -> e.timestampNanos < timestampNanos ? EvictionResult.REMOVE : EvictionResult.STASH_AND_STOP,


Why here we set STASH_AND_STOP instead of STASH? The difference is that we won't handle queue if we set STASH_AND_STOP here. And the method behave like we won't evict all entries before timestamp, because there maybe entries before timestamp still in removalQueue.

the most recent version of the code is more clear.

Since all entries in the queue are in insert-order, there's no longer to keep on processing the remaining entries after hitting an entry that hasn't expired. the difference between STASH and STASH_AND_STOP is that STASH will add to the stash and continue, but the other one will also stop processing remaining entries. STASH_AND_STOP is only used to evict by size.

While evicting by size, the entries that aren't evictable or haven't expired will be added to the stash.

Yes! That's reasonable!

berg223 · 2025-05-28T06:51:48Z

...ed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeCacheRemovalQueue.java

+class RangeCacheRemovalQueue {
+    // The removal queue is unbounded, but we allocate memory in chunks to avoid frequent memory allocations.
+    private static final int REMOVAL_QUEUE_CHUNK_SIZE = 128 * 1024;
+    private final MpscUnboundedArrayQueue<RangeCacheEntryWrapper> removalQueue = new MpscUnboundedArrayQueue<>(


Will it be better if removalQueue is a PriorityQueue and ordered by expectedReadCount? However, PriorityQueue is not fast when expectedReadCount change very frequently.

I got it. The queue will be used to keep the insert-order.

Thanks for asking! these questions help in explaining the design and will be useful content for the PIP when I write that.

Answer:
perhaps. The reason for the Mpsc queue is that it has minimal overhead. The broker cache in Pulsar is a performance hotspot and minimizing overhead in adding and removing is one of the design goals.
Without proper benchmarks, it's obviously hard to compare the actual difference in performance.
It's just based on gut feeling that minimizing work and simplifying will usually end up in the fastest algorithm.

I would assume that some sort of "generation" based removal could be a useful direction, kinda taking some inspiration from garbage collection algorithms. the current removal queue + stash are already in that direction.

Another reason why this problem is more like GC is that expectedReadCount is dynamic. It is incremented in at least 2 cases:

when entries get added to the replay queue

when a cache entry already exists and a new entry is attempted to be added

Since expectedReadCount is dynamic, typical priority queue algorithms wouldn't be able to handle that.

Yet another design goal is minimize synchronization in broker cache so that it could scale when the system has a lot of CPU cores. In the current design eviction itself is intentionally single threaded. Later on it could be possible to add sharding if this part would become a bottleneck, however that is unlikely.
The minimized synchronization will help so that insertion and lookups in the cache don't get constrained by shared locks and can scale when the broker runs on more CPU cores. This is also why StampedLock is used in RangeCacheEntryWrapper. Many operations are racy in RangeCache and there's a solution to ensure sufficient consistency with stamped locks.

- it doesn't make sense to have this test in MessageDispatchThrottlingTest

berg223 · 2025-05-28T10:57:05Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeEntryCacheImpl.java


        if (log.isDebugEnabled()) {
            log.debug("[{}] Adding entry to cache: {} - size: {}", ml.getName(), entry.getPosition(),
                    entryLength);
        }

        Position position = entry.getPosition();
-        if (entries.exists(position)) {
-            return false;
+        CachedEntry previousEntry = entries.get(position);


Should we call previousEntry.release() after its life end? Because we have called retain method in entries.get(position)

good catch, fixed.

berg223 · 2025-05-28T11:00:37Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/cache/RangeEntryCacheImpl.java

+        CachedEntry previousEntry = entries.get(position);
+        if (previousEntry != null && entry.getReadCountHandler() != null) {
+            // If the entry is already in the cache, increase the expected read count on the existing entry
+            if (previousEntry.increaseReadCount(entry.getReadCountHandler().getExpectedReadCount())) {


Why we increaseReadCount instead of setReadCount here? Why we need to deal this case?

Since the existing cached entry isn't replaced, it's possible that there are now new active consumers which might read this entry later. It's now possible to be accurate on the expected read count, so this is best efforts to keep entries longer in the cache, until the expiration. The inaccuracy doesn't cause much harm since when the cache fills up, many entries might have already been reached the expected read count and that would provide more space.
Entries will eventually get removed after the time out. Increasing the timeout for cache hits could be problematic since the cached entries actually remain more direct memory than just the entry itself. This is due to the fact that the cached ByteBuf is sliced from a parent buffer which could be larger. That parent buffer would be retained until all slices have been released. the managedLedgerCacheCopyEntries setting is to make a copy each time, but that comes with the tradeoff of allocating a new buffer and copying to that.

berg223 · 2025-05-28T11:12:05Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/EntryImpl.java

        entry.setRefCnt(1);
+        entry.decreaseReadCountOnRelease = true;


Seems like dangerous here. Because we need to prevent bad usage in the future. Maybe better to invoke setDecreaseReadCountOnRelease and release after cursor complete read. By the way, is there any other way to decrease expectedReadCount? I only find it decreased in EntryImpl.release method.

Sure, there's potential for misuse in such internal APIs. This can be improved with comments.
I guess more comments should be added how the read count decrementing works. The EntryImpl#create(Entry other) method is used in 2 locations where a new wrapper is returned for an existing entry. In most cases, it's desired to decrement the read count when the entry is released. However, there's still corner cases with PendingReadsManager in how to handle merging of the reads and how to merge expected read counts when 2 separate reads get merged to one read. It's possible that more information is needed in the expected read count than just the count to get it right. Having actual tests to simulate the pending reads and how to get the read count right could be useful.

Thanks for your patient answer ! Why not explicitly call a method to decrease the count after cursor or ledger complete read ? Do you think it's bad to couple them?

Yes, it's better to decouple that. The calling release is already the way to signal that the use of the entry is done.

…nly when cacheEvictionByExpectedReadCount=true

lhotari · 2025-05-29T09:35:20Z

In order to make it easier to get these changes merged to Pulsar master for Pulsar 4.1, I'll split the changes:

Broker cache eviction refactoring and use of single queue for removal
expected read count based retention of entries in the cache

lhotari · 2025-05-29T12:12:25Z

Broker cache eviction refactoring and use of single queue for removal

This is WIP in #208

lhotari · 2025-05-30T06:16:44Z

Broker cache eviction refactoring and use of single queue for removal

This is WIP in #208

PR created upstream in apache#24363. @berg223 please review
Mailing list discussion thread: https://lists.apache.org/thread/ddzzc17b0c218ozq9tx0r3rx5sgljfb0

lhotari · 2025-05-30T08:42:47Z

expected read count based retention of entries in the cache

WIP in #209

lhotari added 30 commits May 26, 2025 11:21

Ensure that cached entry readerIndex is not tied to the original one

5409875

Support caching replayed entries

80d4c8c

Start addressing the removal issue with a removal queue

28b0f61

Move EntryWrapper to upper level

e1948d0

Move RangeCache to cache package

139dd84

Start adding removal queue

4a5b64c

Move towards adding removal queue changes

d70ff11

More moves towards removal queue solution

45b4ac9

Handle removing by size in scheduled task so that blocking would be rare

b1136ba

Remove unused code

10cada7

Update RangeCacheTest with removal queue

82fd89f

Fix imports

a8d270a

Add license headers

2c66293

Handle all cache evictions by the same thread

f13f051

Move RangeCacheRemovalCounters to top level

9ea53c4

Use removal queue in RangeCacheTest

7cfe6a8

Handle eviction

374d7ad

Handle adding atomically to removal queue and cache

62dd753

Fix compilation after rebasing

39ccd79

Improve javadoc

39cb111

Use unbounded queue for removal queue since cache size is bounded by …

f572149

…size

Refactor eviction when cache size exceeds evictionTriggerThreshold

467071a

Start adding cacheEvictionByExpectedReadCount

f1fad64

Add CachedEntry

d98bbf0

Activate cursor when consumers disconnect and connect

3092fdc

disable checkCursorsToCacheEntries when eviction by read count is ena…

5aca503

…bled

Always keep cursors with connected consumers in "active" state when e…

3d2253b

…viction is based on expected read count

Add a way to find cursors before current cursor

de484fa

Pass predicate for deciding whether to cache or not

a2234c3

Reduce coupling to EntryImpl

1dcc305

lhotari added 8 commits May 26, 2025 19:53

Rename getActualSize to getNonEvictableSize

e3d1594

Add solution to continue to reuse an existing entry in the cache

876ff1b

Improve test

8ce5a41

Revisit test

30d17fd

Fix NPE in marking the read count handler as evicted

d3379c0

Fix NPE

3ba1543

Fix test

7a9e0c2

When deleting by size, only stash non-expired entries that aren't eli…

bdb7fb6

…gible for eviction

berg223 reviewed May 28, 2025

View reviewed changes

lhotari added 4 commits May 28, 2025 10:04

evaluate expected read count before the loop

ea4d269

Fix test EntryCacheManagerTest.cacheSizeUpdate

4657e68

Move testBacklogConsumerCacheReads to a separate test class

6e6d938

- it doesn't make sense to have this test in MessageDispatchThrottlingTest

Move the test to a more suitable package and fix it

c1b068b

berg223 reviewed May 28, 2025

View reviewed changes

Release previous entry after use

a16e954

berg223 reviewed May 28, 2025

View reviewed changes

lhotari added 4 commits May 28, 2025 15:41

Move testDeactivatingBacklogConsumer to separate test class

234b61d

Add test group annotation

f944916

Fix expectedReadCount setting in OpAddEntry so that logic is enable o…

3b855c1

…nly when cacheEvictionByExpectedReadCount=true

Fix metrics test so that asynchronous eviction is taken into account

7bf9005

lhotari mentioned this pull request May 29, 2025

Refactor Broker cache eviction #208

Closed

lhotari mentioned this pull request May 30, 2025

[WIP] Add cacheEvictionByExpectedReadCount solution #209

Open

lhotari closed this May 30, 2025

[WIP] Broker cache refactoring and add CacheEvictionByExpectedReadCount eviction strategy #207

[WIP] Broker cache refactoring and add CacheEvictionByExpectedReadCount eviction strategy #207

Uh oh!

Conversation

lhotari commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Status

Modifications

UPDATE about progress

Documentation

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berg223 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berg223 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berg223 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berg223 May 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhotari commented May 29, 2025

Uh oh!

lhotari commented May 29, 2025

Uh oh!

lhotari commented May 30, 2025

Uh oh!

lhotari commented May 30, 2025

Uh oh!

Uh oh!

[WIP] Broker cache refactoring and add `CacheEvictionByExpectedReadCount` eviction strategy #207

[WIP] Broker cache refactoring and add `CacheEvictionByExpectedReadCount` eviction strategy #207

lhotari commented May 26, 2025 •

edited

Loading

berg223 May 28, 2025 •

edited

Loading

berg223 May 28, 2025 •

edited

Loading

berg223 May 28, 2025 •

edited

Loading

berg223 May 28, 2025 •

edited

Loading