Skip to content

[WIP] Add cacheEvictionByExpectedReadCount solution #209

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: lh-broker-cache-eviction-optimization
Choose a base branch
from

Conversation

lhotari
Copy link
Owner

@lhotari lhotari commented May 30, 2025

Splits cacheEvictionByExpectedReadCount solution from #207, rebased on apache#24363

@lhotari
Copy link
Owner Author

lhotari commented May 30, 2025

I added a test case org.apache.pulsar.broker.cache.BrokerEntryCacheTest.testCatchUpReadsWithFailureProxyDisconnectingAllConnections which demonstrates a catch-up read after all consumers disconnect multiple times (every 2 seconds). Cache hits are <10% for all other cache algorithms and configs. It's usually over 70% for cacheEvictionByExpectedReadCount. It's rather obvious that this is the case since the broker cache gets cleared for other cache algorithms when all consumers disconnect. cacheEvictionByExpectedReadCount relies on the efficient eviction to let the entries remain in the cache since a consumer might return to consume the entries later. A rolling restart scenario of shared and key-shared subscription consumers would be a more realistic scenario to cover.

@@ -2264,17 +2265,17 @@ protected void asyncReadEntry(ReadHandle ledger, Position position, ReadEntryCal

protected void asyncReadEntry(ReadHandle ledger, long firstEntry, long lastEntry, OpReadEntry opReadEntry,
Object ctx) {
boolean shouldCacheEntry = opReadEntry.cursor.isCacheReadEntry();
IntSupplier expectedReadCount = () -> opReadEntry.cursor.getNumberOfCursorsAtSamePositionOrBefore();
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we decide to decrepted isCacheReadEntry(). Why not delete related method like checkCursorsToCacheEntries and isCacheReadEntry?

@@ -4684,7 +4690,7 @@ public boolean checkInactiveLedgerAndRollOver() {


public void checkCursorsToCacheEntries() {
if (minBacklogCursorsForCaching < 1) {
if (minBacklogCursorsForCaching < 1 || config.isCacheEvictionByExpectedReadCount()) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just cleanup the method here if it's not neccessary.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We had better consider to keep the behavior that cursor position difference less than maxBacklogBetweenCursorsForCaching because there is real scenario apache#12258 or test on the scenario.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is intentionally not executed when config.isCacheEvictionByExpectedReadCount(). The logic here is related to the "backlogged cursors caching" solution added in apache#12258 as you have pointed out.
The cacheEvictionByExpectedReadCount solution is intended to be an alternative while keeping the behavior of the other caching strategies about the same as before this change.

Copy link

@berg223 berg223 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will never use opReadEntry.cursor.isCacheReadEntry() after this PR. So actually it's deprecated right?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for missing getNumberOfCursorsAtSamePositionOrBefore. It's not decpreted.


// Unpause all consumers
for (Consumer<Long> consumer : consumers) {
consumer.resume();
Copy link

@berg223 berg223 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another scenario here. What if consumer never resume? By other word, is there any negative effects after user decide to offline the consumer which has the largest backlog?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unnecessarily cached entries will expire by the configured managedLedgerCacheEvictionTimeThresholdMillis. When there are remaining read counts for a cached entry, they will be deprioritized in evicting entries by size, when the cache fills up. The current changes in this PR are such that there's only a solution to skip entries with remaining read counts when evicting entries by size. The idea is to start a new scan of entries deleting entries with 1 additional read count until the cache size is under limits. This could be the case when the eviction time is long and the cache fills up very quickly when there's high throughput on the server. It could be possible to optimize the eviction further to be more efficient in these cases.

Copy link

@berg223 berg223 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It will degenerate to be expiry strategy. Maybe we can optimize it in the future!

@berg223
Copy link

berg223 commented Jun 3, 2025

I added a test case org.apache.pulsar.broker.cache.BrokerEntryCacheTest.testCatchUpReadsWithFailureProxyDisconnectingAllConnections which demonstrates a catch-up read after all consumers disconnect multiple times (every 2 seconds). Cache hits are <10% for all other cache algorithms and configs. It's usually over 70% for cacheEvictionByExpectedReadCount. It's rather obvious that this is the case since the broker cache gets cleared for other cache algorithms when all consumers disconnect. cacheEvictionByExpectedReadCount relies on the efficient eviction to let the entries remain in the cache since a consumer might return to consume the entries later. A rolling restart scenario of shared and key-shared subscription consumers would be a more realistic scenario to cover.

Awesome example to learn test cache in pulsar! Just leave a comment here for future need.


@Test(groups = "broker-api")
@Slf4j
public class BrokerEntryCacheTest extends ProducerConsumerBase {
Copy link

@berg223 berg223 Jun 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test seems valueful. Will you push this test to PR? Will this block CI test if we push it? How we manage or organize benchmark test code in pulsar?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still experimental. I hope to have useful tests to cover typical broker cache scenarios, but this test isn't yet there. A rolling restart of brokers and a rolling restart of applications (consumers) are more typical scenarios and those should be covered with various subscription types since those bring up the inefficiencies in the current broker cache implementations. Mixed consumer speed for key_shared is also one that causes broker cache issues to show up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants