Skip to content

[Bug] Memory Leak In Netty Recycler of Bookie Client #24355

Closed
@TakaHiR07

Description

@TakaHiR07

Search before reporting

  • I searched in the issues and found nothing similar.

Read release policy

  • I understand that unsupported versions don't get bug fixes. I will attempt to reproduce the issue on a supported version of Pulsar client and Pulsar broker.

User environment

pulsar-2.9.x and pulsar-3.0.x both have this memory leak

Issue Description

After running the broker for a long time, it is found that the broker heap memory and zgc time keep increasing.
After dumping the heap memory, we found that reason is in netty recycler which is use for cache of bookie client variable. The recycler used memory keep increasing.

As seen in the heapdump, there are so many LocalPools in one FastThreadLocalThread, and the consumerBuffer contain so many reference in one LocalPool.

Our setting is io.netty.recycler.maxCapacityPerThread=1024, PerchannelBookieClient number is 16 * 500=8000, 16 is the broker default config, 500 is the number of bookies. And if change to io.netty.recycler.maxCapacityPerThread=0, the memory leak issue is fixed, but the write and read performance would decrease. -Dpulsar.allocator.leak_detection=Advanced and -Dio.netty.leakDetectionLevel=PARANOID is set and no information is log.

Image
Image

Image

Image

Error messages


Reproducing the issue

continue running broker, start a perf produce process with large qps, normal throughput can reproduce.

Issue Analysis

The root reason is each perChannelBookieClient has separate recycler and one broker would generate so many recyclers.

Such for our cluster, the recycler number = 16 * 500 * 2 = 16000, 16 is the bookkeeperNumberOfChannelsPerBookie config in broker.conf, 500 is the bookies number in one cluster, 2 is corresponding to two recycler in bookieClient, AddCompletion and EntryCompletionKey.

All the perChannelBookieClient share the same threadPool : BookieClientWorker. The thread number = cpu core number = 32.

Therefore, the largest object number cache in one broker's recycler is : 16000 * 32 * 1024 = 524288000 (1024 is the io.netty.recycler.maxCapacityPerThread). If one object is 300 Bytes, the full space of recycler object is : 150GB. That's the reason why occur memory leak.

Actually the root reason is in bkClient, the recycler in perChannelBookieClient is not static, which would result in generating too many recyclers. The more bookies in cluster or the more ledgers created in cluster, the easier memory increase in broker.

Are you willing to submit a PR?

  • I'm willing to submit a PR!

Metadata

Metadata

Assignees

No one assigned

    Labels

    type/bugThe PR fixed a bug or issue reported a bug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions