-
Notifications
You must be signed in to change notification settings - Fork 1
[WIP] Add cacheEvictionByExpectedReadCount solution #209
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: lh-broker-cache-eviction-optimization
Are you sure you want to change the base?
[WIP] Add cacheEvictionByExpectedReadCount solution #209
Conversation
7df3818
to
c69b40d
Compare
I added a test case org.apache.pulsar.broker.cache.BrokerEntryCacheTest.testCatchUpReadsWithFailureProxyDisconnectingAllConnections which demonstrates a catch-up read after all consumers disconnect multiple times (every 2 seconds). Cache hits are <10% for all other cache algorithms and configs. It's usually over 70% for cacheEvictionByExpectedReadCount. It's rather obvious that this is the case since the broker cache gets cleared for other cache algorithms when all consumers disconnect. cacheEvictionByExpectedReadCount relies on the efficient eviction to let the entries remain in the cache since a consumer might return to consume the entries later. A rolling restart scenario of shared and key-shared subscription consumers would be a more realistic scenario to cover. |
@@ -2264,17 +2265,17 @@ protected void asyncReadEntry(ReadHandle ledger, Position position, ReadEntryCal | |||
|
|||
protected void asyncReadEntry(ReadHandle ledger, long firstEntry, long lastEntry, OpReadEntry opReadEntry, | |||
Object ctx) { | |||
boolean shouldCacheEntry = opReadEntry.cursor.isCacheReadEntry(); | |||
IntSupplier expectedReadCount = () -> opReadEntry.cursor.getNumberOfCursorsAtSamePositionOrBefore(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we decide to decrepted isCacheReadEntry(). Why not delete related method like checkCursorsToCacheEntries
and isCacheReadEntry
?
@@ -4684,7 +4690,7 @@ public boolean checkInactiveLedgerAndRollOver() { | |||
|
|||
|
|||
public void checkCursorsToCacheEntries() { | |||
if (minBacklogCursorsForCaching < 1) { | |||
if (minBacklogCursorsForCaching < 1 || config.isCacheEvictionByExpectedReadCount()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just cleanup the method here if it's not neccessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had better consider to keep the behavior that cursor position difference less than maxBacklogBetweenCursorsForCaching because there is real scenario apache#12258 or test on the scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is intentionally not executed when config.isCacheEvictionByExpectedReadCount()
. The logic here is related to the "backlogged cursors caching" solution added in apache#12258 as you have pointed out.
The cacheEvictionByExpectedReadCount
solution is intended to be an alternative while keeping the behavior of the other caching strategies about the same as before this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will never use opReadEntry.cursor.isCacheReadEntry()
after this PR. So actually it's deprecated right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for missing getNumberOfCursorsAtSamePositionOrBefore
. It's not decpreted.
|
||
// Unpause all consumers | ||
for (Consumer<Long> consumer : consumers) { | ||
consumer.resume(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another scenario here. What if consumer never resume? By other word, is there any negative effects after user decide to offline the consumer which has the largest backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The unnecessarily cached entries will expire by the configured managedLedgerCacheEvictionTimeThresholdMillis
. When there are remaining read counts for a cached entry, they will be deprioritized in evicting entries by size, when the cache fills up. The current changes in this PR are such that there's only a solution to skip entries with remaining read counts when evicting entries by size. The idea is to start a new scan of entries deleting entries with 1 additional read count until the cache size is under limits. This could be the case when the eviction time is long and the cache fills up very quickly when there's high throughput on the server. It could be possible to optimize the eviction further to be more efficient in these cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. It will degenerate to be expiry strategy. Maybe we can optimize it in the future!
Awesome example to learn test cache in pulsar! Just leave a comment here for future need. |
|
||
@Test(groups = "broker-api") | ||
@Slf4j | ||
public class BrokerEntryCacheTest extends ProducerConsumerBase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test seems valueful. Will you push this test to PR? Will this block CI test if we push it? How we manage or organize benchmark test code in pulsar?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still experimental. I hope to have useful tests to cover typical broker cache scenarios, but this test isn't yet there. A rolling restart of brokers and a rolling restart of applications (consumers) are more typical scenarios and those should be covered with various subscription types since those bring up the inefficiencies in the current broker cache implementations. Mixed consumer speed for key_shared is also one that causes broker cache issues to show up.
Splits cacheEvictionByExpectedReadCount solution from #207, rebased on apache#24363