[improve][broker] optimize the problem of subscription snapshot cache not hitting #24300

liudezhi2098 · 2025-05-14T07:04:06Z

Motivation

When message acknowledgment confirmation is slower than message consumption rate, subscription cursor synchronization fails to complete. This occurs because:

Current Behavior

With large receiver queues (e.g., receiverQueueSize=1000), the cursor never synchronizes

 Consumer<String> consumer = client.newConsumer(Schema.STRING)
           .topic(topic)
           .subscriptionName("sub")
           .receiverQueueSize(1000)
            .subscribe();
  while (true) {
            Message<String> msg = consumer.receive();
             consumer.acknowledge(msg);
             Thread.sleep(100);
        }

With small queues (e.g., receiverQueueSize=1), synchronization works properly

 Consumer<String> consumer = client.newConsumer(Schema.STRING)
           .topic(topic)
           .subscriptionName("sub")
           .receiverQueueSize(1)
            .subscribe();
  while (true) {
            Message<String> msg = consumer.receive();
             consumer.acknowledge(msg);
             Thread.sleep(100);
        }

Root Cause:
- The SnapshotCache updates too aggressively
- When advancedMarkDeletePosition executes, valid cache entries are frequently unavailable
- Current eviction strategy doesn't account for periodic synchronization needs
  https://github.com/apache/pulsar/blob/7be22eb2b23057bd5e09c361a43d6ccdcc0c8afd/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java#L44C4-L59C6

Modifications

Cache Update Strategy:
- modified the cache to maintain mapping relationships for remote clusters.
Eviction Policy Enhancement:
- When cache reaches capacity (maxSnapshotToCache):
  - Allow subsequent snapshots to be added through periodic dynamic adjustment
  - The latest snapshot is used to replace the intermediate snapshot of the cache, and the update becomes slower as the difference between the latest snapshot time and the Mark Delete Position time increases.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

(or)

This change is already covered by existing tests, such as (please describe tests).

(or)

This change added tests and can be verified as follows:

(example:)

Added integration tests for end-to-end deployment with large payloads (10MB)
Extended integration test for recovery after broker failure

Does this pull request potentially affect one of the following parts:

If the box was checked, please highlight the changes

Documentation

doc
doc-required
doc-not-needed
doc-complete

Matching PR in forked repository

PR in forked repository:

lhotari · 2025-05-14T10:45:11Z

When message acknowledgment confirmation is slower than message consumption rate, subscription cursor synchronization fails to complete.

@liudezhi2098 Regarding this scenario, is there a way to find out this happens?
Does the metric pulsar_replicated_subscriptions_timedout_snapshots added in #22381 help detecting problems?

lhotari

Great work @liudezhi2098! Added a comment about adding code comments. :)

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java

lhotari

Since the publishTime information comes from Pulsar client, the logic could be brittle. In Pulsar, there's also an optional brokerPublishTime which is not enabled by default (requires specific configuration in all brokers).

Have you considered what could happen when publishTime values aren't in sync?

I've understood that the "PIP-33: Replicated subscriptions" algorithm rely on vector clocks so that clock sync doesn't become a problem. (earlier discussion)
The snapshots are the way how the vector clocks are synchronized, at least that's how I interpret it from one view point.

The changes in this PR don't currently make sense to me, mainly due to the use of publishTime.

I'd assume that in your problem scenario, the correct approach would be to tune replicatedSubscriptionsSnapshotFrequencyMillis, replicatedSubscriptionsSnapshotTimeoutSeconds and replicatedSubscriptionsSnapshotMaxCachedPerSubscription values.

pulsar/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java

Lines 1444 to 1457 in 825b862

    
           @FieldContext( 
        
                   category = CATEGORY_SERVER, 
        
                   doc = "Frequency of snapshots for replicated subscriptions tracking.") 
        
           private int replicatedSubscriptionsSnapshotFrequencyMillis = 1_000; 
        
           @FieldContext( 
        
                   category = CATEGORY_SERVER, 
        
                   doc = "Timeout for building a consistent snapshot for tracking replicated subscriptions state. ") 
        
           private int replicatedSubscriptionsSnapshotTimeoutSeconds = 30; 
        
           @FieldContext( 
        
                   category = CATEGORY_SERVER, 
        
                   doc = "Max number of snapshot to be cached per subscription.") 
        
           private int replicatedSubscriptionsSnapshotMaxCachedPerSubscription = 10;

Have you already done this?

Currently it is a problem that it's necessary to tune the values to fix issues and it's also hard to notice that the problem is occurring.

It looks like future improvements are needed too.

lhotari · 2025-05-14T11:27:00Z

Current Behavior

With large receiver queues (e.g., receiverQueueSize=1000), the cursor never synchronizes

@liangyepianzhou Do you have a separate repro app where this could be observed with real brokers (let's say 2 Pulsar broker within docker-compose + some test app)? Creating a separate Git repo for such a repro would be one approach to share it. Having a runnable repro makes things easier for reviewers too.

lhotari · 2025-05-14T11:32:29Z

The SnapshotCache updates too aggressively

When advancedMarkDeletePosition executes, valid cache entries are frequently unavailable

@liudezhi2098 One thought here is that perhaps there could be interaction between ReplicatedSubscriptionsController and all ReplicatedSubscriptionSnapshotCache instances? Could there be a solution that when "updates too aggressively" that there's a solution in place that a snapshot would be completed every replicatedSubscriptionsSnapshotFrequencyMillis.
Since the ReplicatedSubscriptionSnapshotCache is an internal interface, we don't need to keep it as a "cache". It's possible that it doesn't make sense in the revisited solution.
Do you have a chance to try something in this area instead since I don't think that using publishTime in the solution makes sense.

liudezhi2098 · 2025-05-14T12:35:26Z

Since the publishTime information comes from Pulsar client, the logic could be brittle. In Pulsar, there's also an optional brokerPublishTime which is not enabled by default (requires specific configuration in all brokers).

Have you considered what could happen when publishTime values aren't in sync?

I've understood that the "PIP-33: Replicated subscriptions" algorithm rely on vector clocks so that clock sync doesn't become a problem. (earlier discussion) The snapshots are the way how the vector clocks are synchronized, at least that's how I interpret it from one view point.

The changes in this PR don't currently make sense to me, mainly due to the use of publishTime.

I'd assume that in your problem scenario, the correct approach would be to tune replicatedSubscriptionsSnapshotFrequencyMillis, replicatedSubscriptionsSnapshotTimeoutSeconds and replicatedSubscriptionsSnapshotMaxCachedPerSubscription values.

pulsar/pulsar-broker-common/src/main/java/org/apache/pulsar/broker/ServiceConfiguration.java

Lines 1444 to 1457 in 825b862

@FieldContext(

category = CATEGORY_SERVER,

doc = "Frequency of snapshots for replicated subscriptions tracking.")

private int replicatedSubscriptionsSnapshotFrequencyMillis = 1_000;

@FieldContext(

category = CATEGORY_SERVER,

doc = "Timeout for building a consistent snapshot for tracking replicated subscriptions state. ")

private int replicatedSubscriptionsSnapshotTimeoutSeconds = 30;

@FieldContext(

category = CATEGORY_SERVER,

doc = "Max number of snapshot to be cached per subscription.")

private int replicatedSubscriptionsSnapshotMaxCachedPerSubscription = 10;

Have you already done this?

Currently it is a problem that it's necessary to tune the values to fix issues and it's also hard to notice that the problem is occurring.

It looks like future improvements are needed too.

@lhotari The generation of snapshots is completed through the exchange of snapshotRequest and snapshotResponse between two clusters. Ultimately, the ReplicatedSubscriptionsController writes Marker messages, and using publishTime is reliable because this behavior occurs within the same broker.

Of course, the topic may be transferred to another broker, but this is a low-frequency scenario, and its publishTime will not exhibit continuous jumps.

However, we can adopt a simpler approach that doesn't require using publishTime. Instead, we can record the current system time each time the snapshotCache，but there is a flaw that it cannot truly reflect the time difference between two messages. In some scenarios, it will cause the cache update frequency to decrease.

lhotari

The generation of snapshots is completed through the exchange of snapshotRequest and snapshotResponse between two clusters. Ultimately, the ReplicatedSubscriptionsController writes Marker messages, and using publishTime is reliable because this behavior occurs within the same broker.

Of course, the topic may be transferred to another broker, but this is a low-frequency scenario, and its publishTime will not exhibit continuous jumps.

However, we can adopt a simpler approach that doesn't require using publishTime. Instead, we can record the current system time each time the snapshotCache is updated and use this timestamp for dynamic adjustments.

Thanks for explaining that. I missed that point that the marker messages are originated from the same broker. publishTime would be fine due to that detail.

When looking at the current master branch code in ReplicatedSubscriptionSnapshotCache.addNewSnapshot, I'd assume that a potential solution to the problem could be that the current mark delete position is taken into account before purging entries.

It looks like the problem arrises when there's isn't at least one entry that is older than the current mark delete position in the cache.

I'd suggest to revisit the purging logic in this way:

modify addNewSnapshot and add a 2nd parameter which is the current mark delete position
always keep the newest entry that is before the current mark delete position when purging entries, all older entries can be purged
if the cache remains full after after doing this, remove a single entry in the cache so that the new entry could be added.
- keep the position of the last removed entry so that it's possible to continue the purging algorithm in subsequent calls
- if there's no previous last removed entry, purge the 2nd entry (assuming that the first entry is the newest entry before current mark delete position)
- if there's a previous last removed entry, continue purging from next entry after the last removed position by first skipping one entry and then removing the 2nd entry
- if there's no more entries to remove, start removing from the beginning.

This purging logic should always result in making it possible to add a new entry. Since every 2nd entry is removed, it will result in "sampling" so that when the mark delete position finally advances, it advances to the most recent position.

There's a possibility to increase replicatedSubscriptionsSnapshotMaxCachedPerSubscription parameter to improve the resolution, if that's desirable.

Perhaps the intention of your timestamp based approach is already to achieve something similar?

liudezhi2098 · 2025-05-14T13:41:40Z

When looking at the current master branch code in ReplicatedSubscriptionSnapshotCache.addNewSnapshot, I'd assume that a potential solution to the problem could be that the current mark delete position is taken into account before purging entries.

It looks like the problem arrises when there's isn't at least one entry that is older than the current mark delete position in the cache.

I'd suggest to revisit the purging logic in this way:

modify addNewSnapshot and add a 2nd parameter which is the current mark delete position

always keep the newest entry that is before the current mark delete position when purging entries, all older entries can be purged

if the cache remains full after after doing this, remove a single entry in the cache so that the new entry could be added.

keep the position of the last removed entry so that it's possible to continue the purging algorithm in subsequent calls

if there's no previous last removed entry, purge the 2nd entry (assuming that the first entry is the newest entry before current mark delete position)

if there's a previous last removed entry, continue purging from next entry after the last removed position by first skipping one entry and then removing the 2nd entry

if there's no more entries to remove, start removing from the beginning.

This purging logic should always result in making it possible to add a new entry. Since every 2nd entry is removed, it will result in "sampling" so that when the mark delete position finally advances, it advances to the most recent position.

There's a possibility to increase replicatedSubscriptionsSnapshotMaxCachedPerSubscription parameter to improve the resolution, if that's desirable.

Perhaps the intention of your timestamp based approach is already to achieve something similar?

@lhotari The intention of timestamp based is to achieve this purpose，the key is when cache is full, how to update the cache,
there is no perfect algorithm to solve this problem.

I recommend using Median-based eviction for simplicity, try to make the cache an arithmetic progression in time, because for the shared mode subscription, there will be individual unconfirmed messages, presenting a very jumpy situation.

lhotari · 2025-05-14T14:27:06Z

The intention of timestamp based is to achieve this purpose，the key is when cache is full, how to update the cache, there is no perfect algorithm to solve this problem.

I recommend using Median-based eviction for simplicity, try to make the cache an arithmetic progression in time, because for the shared mode subscription, there will be individual unconfirmed messages, presenting a very jumpy situation.

You are right about this. The added comments in the code make it easier to understand the intention of the logic. My previous comment about taking the mark deletion position into account in adding snapshots didn't make much sense after rethinking.
I'll review again.

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java

pulsar-common/src/main/proto/PulsarMarkers.proto

lhotari

Great work, mainly comments about comments. The 2nd rule in skipping to add entries, timeSinceLastSnapshot < timeWindowPerSlot, seems risky to add since it could have surprising consequences. I think it would be better to remove that.

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java

lhotari · 2025-05-15T19:12:13Z

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java

+        // The time window length of each time slot, used for dynamic adjustment in the snapshot cache.
+        // The larger the time slot, the slower the update.
+        final long timeWindowPerSlot = timeSinceFirstSnapshot / snapshotFrequencyMillis / maxSnapshotToCache;


this is a bit hard to grasp, what the "time slot" concept is here.
Let's say, timeSinceFirstSnapshot is 25000 ms, snapshotFrequencyMillis is 1000ms and maxSnapshotToCache is 10, it would result in 2. What's the point of this?
With low values, this would be close to 0, I guess. This is also why I think this is just unnecessary complexity.

What if timeSinceFirstSnapshot is 25 minutes? The goal is that if timeSinceFirstSnapshot becomes longer, the update frequency should be lower.

What if timeSinceFirstSnapshot is 25 minutes? The goal is that if timeSinceFirstSnapshot becomes longer, the update frequency should be lower.

This doesn't seem to be a realistic case. I disagree that the update frequency should become lower. That's exactly my point that if there's a long delay, it will be enforced going forward. I think it's easier to make progress in this PR by removing this rule and adding it later if there's a specific reason to do so.

lhotari · 2025-05-16T06:42:44Z

I recommend using Median-based eviction for simplicity, try to make the cache an arithmetic progression in time, because for the shared mode subscription, there will be individual unconfirmed messages, presenting a very jumpy situation.

I don't see how the median based eviction could make sense. After the cache is filled up, when the median entry is removed and a new entry is added, and this repeats, the result will be that only entries after the median entry will be evicted (assuming no other events happen in between). Eventually there will be a large gap between the 2 entries in the middle.

Since time is already considered in the algorithm, it seems that an alternative approach would be to evict the entry with the shortest time distance to it's adjacent entries. Makes sense?

lhotari · 2025-05-22T11:00:23Z

I recommend using Median-based eviction for simplicity, try to make the cache an arithmetic progression in time, because for the shared mode subscription, there will be individual unconfirmed messages, presenting a very jumpy situation.

I don't see how the median based eviction could make sense. After the cache is filled up, when the median entry is removed and a new entry is added, and this repeats, the result will be that only entries after the median entry will be evicted (assuming no other events happen in between). Eventually there will be a large gap between the 2 entries in the middle.

Since time is already considered in the algorithm, it seems that an alternative approach would be to evict the entry with the shortest time distance to it's adjacent entries. Makes sense?

@liudezhi2098 Just wondering if you are fine with the provided feedback on this PR? It would be great to address this issue in replicated subscriptions and get this PR to completion.

lhotari · 2025-05-30T05:37:02Z

@liudezhi2098 Are you planning to continue working on this? I think that this is a really great improvement to address a long time issue with replicated subscriptions.

lhotari · 2025-06-02T14:57:06Z

@liudezhi2098 There's also a long-standing issue #10054 which is addressed by #16651. I have updated PR 16651 description, rebased it and revisited it slightly. Please review

optimize subscription snapshot cache algorithm

ad98307

liudezhi2098 self-assigned this May 14, 2025

github-actions bot added the doc-not-needed Your PR changes do not impact docs label May 14, 2025

liangyepianzhou added this to the 4.1.0 milestone May 14, 2025

lhotari reviewed May 14, 2025

View reviewed changes

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java Outdated Show resolved Hide resolved

lhotari requested changes May 14, 2025

View reviewed changes

add code comment

586f879

lhotari reviewed May 14, 2025

View reviewed changes

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java Show resolved Hide resolved

...in/java/org/apache/pulsar/broker/service/persistent/ReplicatedSubscriptionSnapshotCache.java Outdated Show resolved Hide resolved

Simplify the code

2aaaffd

liudezhi2098 requested a review from lhotari May 15, 2025 03:36

lhotari reviewed May 15, 2025

View reviewed changes

pulsar-common/src/main/proto/PulsarMarkers.proto Outdated Show resolved Hide resolved

Simplify the code

c412822

liudezhi2098 requested a review from lhotari May 15, 2025 13:49

lhotari reviewed May 15, 2025

View reviewed changes

Optimize Annotations

5af3574

lhotari requested review from merlimat, nodece and dao-jun May 30, 2025 05:37

lhotari added the triage/lhotari/important lhotari's triaging label for important issues or PRs label May 30, 2025

	@FieldContext(
	category = CATEGORY_SERVER,
	doc = "Frequency of snapshots for replicated subscriptions tracking.")
	private int replicatedSubscriptionsSnapshotFrequencyMillis = 1_000;

	@FieldContext(
	category = CATEGORY_SERVER,
	doc = "Timeout for building a consistent snapshot for tracking replicated subscriptions state. ")
	private int replicatedSubscriptionsSnapshotTimeoutSeconds = 30;

	@FieldContext(
	category = CATEGORY_SERVER,
	doc = "Max number of snapshot to be cached per subscription.")
	private int replicatedSubscriptionsSnapshotMaxCachedPerSubscription = 10;

[improve][broker] optimize the problem of subscription snapshot cache not hitting #24300

Are you sure you want to change the base?

[improve][broker] optimize the problem of subscription snapshot cache not hitting #24300

Uh oh!

Conversation

liudezhi2098 commented May 14, 2025

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

Matching PR in forked repository

Uh oh!

lhotari commented May 14, 2025

Uh oh!

lhotari left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lhotari left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhotari commented May 14, 2025

Uh oh!

lhotari commented May 14, 2025

Uh oh!

liudezhi2098 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

liudezhi2098 commented May 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented May 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhotari left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

lhotari May 15, 2025

Choose a reason for hiding this comment

Uh oh!

liudezhi2098 May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lhotari May 16, 2025

Choose a reason for hiding this comment

Uh oh!

lhotari commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lhotari commented May 22, 2025

Uh oh!

lhotari commented May 30, 2025

Uh oh!

lhotari commented Jun 2, 2025

Uh oh!

Uh oh!

lhotari left a comment •

edited

Loading

liudezhi2098 commented May 14, 2025 •

edited

Loading

lhotari left a comment •

edited

Loading

liudezhi2098 commented May 14, 2025 •

edited

Loading

lhotari left a comment •

edited

Loading

liudezhi2098 May 16, 2025 •

edited

Loading

lhotari commented May 16, 2025 •

edited

Loading