[WIP] Add cacheEvictionByExpectedReadCount solution #209

lhotari · 2025-05-30T08:42:17Z

Splits cacheEvictionByExpectedReadCount solution from #207, rebased on apache#24363

lhotari · 2025-05-30T18:39:10Z

I added a test case org.apache.pulsar.broker.cache.BrokerEntryCacheTest.testCatchUpReadsWithFailureProxyDisconnectingAllConnections which demonstrates a catch-up read after all consumers disconnect multiple times (every 2 seconds). Cache hits are <10% for all other cache algorithms and configs. It's usually over 70% for cacheEvictionByExpectedReadCount. It's rather obvious that this is the case since the broker cache gets cleared for other cache algorithms when all consumers disconnect. cacheEvictionByExpectedReadCount relies on the efficient eviction to let the entries remain in the cache since a consumer might return to consume the entries later. A rolling restart scenario of shared and key-shared subscription consumers would be a more realistic scenario to cover.

berg223 · 2025-06-03T05:04:24Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -2264,17 +2265,17 @@ protected void asyncReadEntry(ReadHandle ledger, Position position, ReadEntryCal

    protected void asyncReadEntry(ReadHandle ledger, long firstEntry, long lastEntry, OpReadEntry opReadEntry,
            Object ctx) {
-        boolean shouldCacheEntry = opReadEntry.cursor.isCacheReadEntry();
+        IntSupplier expectedReadCount = () -> opReadEntry.cursor.getNumberOfCursorsAtSamePositionOrBefore();


If we decide to decrepted isCacheReadEntry(). Why not delete related method like checkCursorsToCacheEntries and isCacheReadEntry?

berg223 · 2025-06-03T05:05:27Z

managed-ledger/src/main/java/org/apache/bookkeeper/mledger/impl/ManagedLedgerImpl.java

@@ -4684,7 +4690,7 @@ public boolean checkInactiveLedgerAndRollOver() {


    public void checkCursorsToCacheEntries() {
-        if (minBacklogCursorsForCaching < 1) {
+        if (minBacklogCursorsForCaching < 1 || config.isCacheEvictionByExpectedReadCount()) {


We can just cleanup the method here if it's not neccessary.

We had better consider to keep the behavior that cursor position difference less than maxBacklogBetweenCursorsForCaching because there is real scenario apache#12258 or test on the scenario.

This is intentionally not executed when config.isCacheEvictionByExpectedReadCount(). The logic here is related to the "backlogged cursors caching" solution added in apache#12258 as you have pointed out.
The cacheEvictionByExpectedReadCount solution is intended to be an alternative while keeping the behavior of the other caching strategies about the same as before this change.

We will never use opReadEntry.cursor.isCacheReadEntry() after this PR. So actually it's deprecated right?

Sorry for missing getNumberOfCursorsAtSamePositionOrBefore. It's not decpreted.

berg223 · 2025-06-03T05:35:29Z

pulsar-broker/src/test/java/org/apache/pulsar/broker/cache/BrokerEntryCacheTest.java

+
+        // Unpause all consumers
+        for (Consumer<Long> consumer : consumers) {
+            consumer.resume();


Another scenario here. What if consumer never resume? By other word, is there any negative effects after user decide to offline the consumer which has the largest backlog?

The unnecessarily cached entries will expire by the configured managedLedgerCacheEvictionTimeThresholdMillis. When there are remaining read counts for a cached entry, they will be deprioritized in evicting entries by size, when the cache fills up. The current changes in this PR are such that there's only a solution to skip entries with remaining read counts when evicting entries by size. The idea is to start a new scan of entries deleting entries with 1 additional read count until the cache size is under limits. This could be the case when the eviction time is long and the cache fills up very quickly when there's high throughput on the server. It could be possible to optimize the eviction further to be more efficient in these cases.

Yes. It will degenerate to be expiry strategy. Maybe we can optimize it in the future!

berg223 · 2025-06-03T05:41:58Z

I added a test case org.apache.pulsar.broker.cache.BrokerEntryCacheTest.testCatchUpReadsWithFailureProxyDisconnectingAllConnections which demonstrates a catch-up read after all consumers disconnect multiple times (every 2 seconds). Cache hits are <10% for all other cache algorithms and configs. It's usually over 70% for cacheEvictionByExpectedReadCount. It's rather obvious that this is the case since the broker cache gets cleared for other cache algorithms when all consumers disconnect. cacheEvictionByExpectedReadCount relies on the efficient eviction to let the entries remain in the cache since a consumer might return to consume the entries later. A rolling restart scenario of shared and key-shared subscription consumers would be a more realistic scenario to cover.

Awesome example to learn test cache in pulsar! Just leave a comment here for future need.

berg223 · 2025-06-03T05:45:53Z

pulsar-broker/src/test/java/org/apache/pulsar/broker/cache/BrokerEntryCacheTest.java

+
+@Test(groups = "broker-api")
+@Slf4j
+public class BrokerEntryCacheTest extends ProducerConsumerBase {


The test seems valueful. Will you push this test to PR? Will this block CI test if we push it? How we manage or organize benchmark test code in pulsar?

this is still experimental. I hope to have useful tests to cover typical broker cache scenarios, but this test isn't yet there. A rolling restart of brokers and a rolling restart of applications (consumers) are more typical scenarios and those should be covered with various subscription types since those bring up the inefficiencies in the current broker cache implementations. Mixed consumer speed for key_shared is also one that causes broker cache issues to show up.

lhotari mentioned this pull request May 30, 2025

[WIP] Broker cache refactoring and add CacheEvictionByExpectedReadCount eviction strategy #207

Closed

4 tasks

lhotari added 2 commits May 30, 2025 15:07

Add expected read count solution

538e89b

Run PR build

c69b40d

lhotari force-pushed the lh-broker-cache-expected-readcount-retention-wip branch from 7df3818 to c69b40d Compare May 30, 2025 12:35

lhotari mentioned this pull request May 30, 2025

[Enhancement] Make cursor caching eligibility logic reactive since reads don't get cached until checkCursorsToCacheEntries has been called apache/pulsar#23503

Open

2 tasks

Add some broker entry cache end-to-end tests to confirm assumptions

3248615

lhotari mentioned this pull request May 31, 2025

[refactor][ml] Replace cache eviction algorithm with centralized removal queue and job apache/pulsar#24363

Open

4 tasks

berg223 reviewed Jun 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[WIP] Add cacheEvictionByExpectedReadCount solution #209

[WIP] Add cacheEvictionByExpectedReadCount solution #209

Uh oh!

lhotari commented May 30, 2025

Uh oh!

lhotari commented May 30, 2025

Uh oh!

berg223 Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025

Uh oh!

lhotari Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025 •

edited

Loading

Uh oh!

berg223 Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025 •

edited

Loading

Uh oh!

lhotari Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025 •

edited

Loading

Uh oh!

berg223 commented Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025 •

edited

Loading

Uh oh!

lhotari Jun 3, 2025

Uh oh!

Uh oh!

[WIP] Add cacheEvictionByExpectedReadCount solution #209

Are you sure you want to change the base?

[WIP] Add cacheEvictionByExpectedReadCount solution #209

Uh oh!

Conversation

lhotari commented May 30, 2025

Uh oh!

lhotari commented May 30, 2025

Uh oh!

berg223 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

berg223 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

berg223 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

lhotari Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

berg223 Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berg223 Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

berg223 Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhotari Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

berg223 Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

berg223 commented Jun 3, 2025

Uh oh!

berg223 Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhotari Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

berg223 Jun 3, 2025 •

edited

Loading

berg223 Jun 3, 2025 •

edited

Loading

berg223 Jun 3, 2025 •

edited

Loading

berg223 Jun 3, 2025 •

edited

Loading