Skip to content

Conversation

ilixiaocui
Copy link

What changes were proposed in this pull request?

We are currently operating ​​Ratis 2.4.0 in production at significant scale​​, where we've observed two recurring issues related to snapshot installation – consistent with existing community reports (reference: RATIS-2140 RATIS-2208)

​​Would it be possible​​, at your convenience, to consider backporting the associated fixes to the ​​2.x maintenance branch​​? Such an effort would greatly assist our team in planning a ​​stable production upgrade path​​ while continuing to leverage this foundational version.

We sincerely appreciate your guidance on this matter and remain grateful for the community's ongoing stewardship of Ratis.

What is the link to the Apache JIRA

RATIS-2140
RATIS-2208

How was this patch tested?

@ilixiaocui
Copy link
Author

@szetszwo Looking forward to your assistance.

@szetszwo
Copy link
Contributor

@ilixiaocui , sure, we could back port bug fixes to branch-2.

Would you consider upgrading to the recent release 3.2.0?

@ilixiaocui
Copy link
Author

@ilixiaocui , sure, we could back port bug fixes to branch-2.

Would you consider upgrading to the recent release 3.2.0?

Much appreciated!
Since we have dozens of production clusters that need to remain compatible during ongoing upgrades, moving to version 3.2.0 just isn't in the cards for the foreseeable future. Should we plan new clusters down the road, we'll consider upgrading them holistically when we do.

@szetszwo
Copy link
Contributor

@ilixiaocui , could you select a list of commits you like to back port? I could merge them to branch-2.

@ilixiaocui
Copy link
Author

ilixiaocui commented Jul 23, 2025

@ilixiaocui , could you select a list of commits you like to back port? I could merge them to branch-2.

The bugs triggered in the production environment are related to the following two issues. The corresponding commit IDs are based on the ratis-3.2.0 release:

RATIS-2140 related
2e7cb45

RATIS-2208 related
2c4e354
cf893f6
337df17
17ca6f4
5d3476f

Thank you again for your assistance! @szetszwo


In addition, the ratis-3.0.0 release notes summarize many bug fixes from the 2.x series. would you consider backporting these fixes to the 2.x branch as well? The corresponding commit IDs are based on the ratis-3.2.0 release:

RATIS-2116: 6390a28
RATIS-1909: b7ffa1b
RATIS-1895: d461a01
RATIS-1902: 4c8ef9d
RATIS-1912: c35f769
RATIS-1858: 5c47d3b
RATIS-1804: 9535259
RATIS-1883: b8ce6d1
RATIS-1920: 5a8519e
RATIS-1928: 1b05bfc
RATIS-1705: 95b51e5
RATIS-1887: 05f3922
RATIS-1890: be28b39
RATIS-1893: 0e136f3
RATIS-1884: a483bd4
RATIS-872: 22cbefa
RATIS-1916: 7015ba2

@szetszwo
Copy link
Contributor

szetszwo commented Jul 24, 2025

@ilixiaocui , tried to merging the list but some of commits (the ones commented out below) have serious conflicts. Let me see how to resolve them.

git cherry-pick 5c47d3b4cafffa8e2bc21276f302d70efbbed5a9 #RATIS-1858. Follower keeps logging first election timeout. (#894)

git cherry-pick 95b51e512ffa3d0798607b82f8b474649413f2bd #RATIS-1705. Fix metrics leak (#744)

git cherry-pick a6719dc63eb90cc6bdc622a0824101945e746475 #RATIS-1873
git cherry-pick a483bd4bf015b5b368215e0d622ff43ed317b0c7 #RATIS-1884. Fix retry cache warning condition (#915)

git cherry-pick b8ce6d1f6ea37ed3ff9f6e888d2357fe48490567 #RATIS-1883. Next Index should be always larger than Match Index in GrpcLogAppender (#914)
git cherry-pick 05f39221102abc00b2934e279da872d06f6a1811 #RATIS-1887. Gap between segement log (#919)
git cherry-pick be28b3907f4fee8957fb2824770e4925364d0a8f #RATIS-1890. SegmentedRaftLogCache#shouldEvict should only iterate over closed segments once (#921)
git cherry-pick 0e136f39123dc65a07a41c7146ea0e91f0fe1fa7 #RATIS-1893. In SegmentedRaftLogCache, start a daemon thread to checkAndEvictCache. (#924)
git cherry-pick d461a01a53e7e130f0ec4143e75b316012137b62 #RATIS-1895. IllegalStateException: Failed to updateIncreasingly for nextIndex. (#926)

git cherry-pick 8a74dc256c875b46025e24d1d9c9de8e8379a53c #RATIS-1886

git cherry-pick 4c8ef9db16e32d13a1eb07fce12a7563b830a2da #RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (#933)
git cherry-pick b7ffa1ba1e3e7cecd9ea687f72425c2ffd5b1c34 #RATIS-1909. Fix Decreasing Next Index When GrpcLogAppender Reset Client. (#939)
git cherry-pick 5a8519ee6cc40abb999d07154c4c2d12320c2da1 #RATIS-1920. NPE in AppendLogResponseHandler. (#952)
git cherry-pick 7015ba2f274394697dffec417b43374656077d88 #RATIS-1916. OrderAsync does not call handReply. (#948)

# git cherry-pick 22cbefa2c11c3471d2f763ccb4251806ed3529f5 #RATIS-872. Invalidate replied calls in retry cache. (#942)

git cherry-pick c35f769f513609d808ab1cc91c5323d9ff30f636 #RATIS-1912. Fix infinity election when perform membership change. (#954)
git cherry-pick 95352591005a1bf867f9aac9f9c0b337741181e3 #RATIS-1804. Change the default number of outstanding append entires. (#838)
git cherry-pick 1b05bfcc76e4f3007d389dc52ee0305b9fff8e41 #RATIS-1928. Join the LogAppenders when closing the server. (#959)

# git cherry-pick 6390a28bdf1d2c454d49a11dca117e5bbc482f54 #RATIS-2116. Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely (#1116)

git cherry-pick 2e7cb458ca6a10b4c38cafca7e8eee8a8e7fcef1 #RATIS-2140. Thread wait when installing snapshot. (#1137)
# git cherry-pick 2c4e354f133a44b971837ea33b5f89d62302cb63 #RATIS-2232. Improve log for debugging on RaftLog / TransactionManager (#1203)
git cherry-pick 337df17c7ea27fbaac9f5f82f8557dc815830d7c #RATIS-2234. Remove lock race between heartbeat and append log channels (#1205)

git cherry-pick cf893f64906df82908fcc43aed2d575e52f7a174 #RATIS-2233. make NOPROGRESS timeout configurable (#1204)
# git cherry-pick 17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 #RATIS-2235. Allow only one thread to perform appendLog  (#1206)
# git cherry-pick 5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea #RATIS-2242. change consistency criteria of heartbeat during appendLog (#1215)

@ilixiaocui
Copy link
Author

@ilixiaocui , tried to merging the list but some of commits (the ones commented out below) have serious conflicts. Let me see how to resolve them.

git cherry-pick 5c47d3b4cafffa8e2bc21276f302d70efbbed5a9 #RATIS-1858. Follower keeps logging first election timeout. (#894)

git cherry-pick 95b51e512ffa3d0798607b82f8b474649413f2bd #RATIS-1705. Fix metrics leak (#744)

git cherry-pick a6719dc63eb90cc6bdc622a0824101945e746475 #RATIS-1873
git cherry-pick a483bd4bf015b5b368215e0d622ff43ed317b0c7 #RATIS-1884. Fix retry cache warning condition (#915)

git cherry-pick b8ce6d1f6ea37ed3ff9f6e888d2357fe48490567 #RATIS-1883. Next Index should be always larger than Match Index in GrpcLogAppender (#914)
git cherry-pick 05f39221102abc00b2934e279da872d06f6a1811 #RATIS-1887. Gap between segement log (#919)
git cherry-pick be28b3907f4fee8957fb2824770e4925364d0a8f #RATIS-1890. SegmentedRaftLogCache#shouldEvict should only iterate over closed segments once (#921)
git cherry-pick 0e136f39123dc65a07a41c7146ea0e91f0fe1fa7 #RATIS-1893. In SegmentedRaftLogCache, start a daemon thread to checkAndEvictCache. (#924)
git cherry-pick d461a01a53e7e130f0ec4143e75b316012137b62 #RATIS-1895. IllegalStateException: Failed to updateIncreasingly for nextIndex. (#926)

git cherry-pick 8a74dc256c875b46025e24d1d9c9de8e8379a53c #RATIS-1886

git cherry-pick 4c8ef9db16e32d13a1eb07fce12a7563b830a2da #RATIS-1902. The snapshot index is set incorrectly in InstallSnapshotReplyProto. (#933)
git cherry-pick b7ffa1ba1e3e7cecd9ea687f72425c2ffd5b1c34 #RATIS-1909. Fix Decreasing Next Index When GrpcLogAppender Reset Client. (#939)
git cherry-pick 5a8519ee6cc40abb999d07154c4c2d12320c2da1 #RATIS-1920. NPE in AppendLogResponseHandler. (#952)
git cherry-pick 7015ba2f274394697dffec417b43374656077d88 #RATIS-1916. OrderAsync does not call handReply. (#948)

# git cherry-pick 22cbefa2c11c3471d2f763ccb4251806ed3529f5 #RATIS-872. Invalidate replied calls in retry cache. (#942)

git cherry-pick c35f769f513609d808ab1cc91c5323d9ff30f636 #RATIS-1912. Fix infinity election when perform membership change. (#954)
git cherry-pick 95352591005a1bf867f9aac9f9c0b337741181e3 #RATIS-1804. Change the default number of outstanding append entires. (#838)
git cherry-pick 1b05bfcc76e4f3007d389dc52ee0305b9fff8e41 #RATIS-1928. Join the LogAppenders when closing the server. (#959)

# git cherry-pick 6390a28bdf1d2c454d49a11dca117e5bbc482f54 #RATIS-2116. Fix the issue where RaftServerImpl.appendEntries may be blocked indefinitely (#1116)

git cherry-pick 2e7cb458ca6a10b4c38cafca7e8eee8a8e7fcef1 #RATIS-2140. Thread wait when installing snapshot. (#1137)
# git cherry-pick 2c4e354f133a44b971837ea33b5f89d62302cb63 #RATIS-2232. Improve log for debugging on RaftLog / TransactionManager (#1203)
git cherry-pick 337df17c7ea27fbaac9f5f82f8557dc815830d7c #RATIS-2234. Remove lock race between heartbeat and append log channels (#1205)

git cherry-pick cf893f64906df82908fcc43aed2d575e52f7a174 #RATIS-2233. make NOPROGRESS timeout configurable (#1204)
# git cherry-pick 17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 #RATIS-2235. Allow only one thread to perform appendLog  (#1206)
# git cherry-pick 5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea #RATIS-2242. change consistency criteria of heartbeat during appendLog (#1215)

Appreciate it again.

@szetszwo
Copy link
Contributor

szetszwo commented Aug 3, 2025

@ilixiaocui , sorry that I was not able to check conflicts. I should be able to check them sometime next week. In the meantime, please see if you could find out the dependent commits for resolving the confilcts.

If you have a tight deadline, please feel free to share it. I would try my best to accommodate it.

@ilixiaocui
Copy link
Author

@ilixiaocui , sorry that I was not able to check conflicts. I should be able to check them sometime next week. In the meantime, please see if you could find out the dependent commits for resolving the confilcts.

If you have a tight deadline, please feel free to share it. I would try my best to accommodate it.

Thanks again for your reply!

Could you please help cherry-pick these two sets of commits that are already causing issues?

RATIS-2140 related
2e7cb45

RATIS-2208 related
2c4e354
cf893f6
337df17
17ca6f4
5d3476f

The other issues haven’t been directly encountered in our production environment. There’s no urgency on timing—one or two weeks is completely fine.

@szetszwo
Copy link
Contributor

szetszwo commented Aug 4, 2025

RATIS-2208 related
2c4e354
cf893f6
337df17
17ca6f4
5d3476f

@ilixiaocui , the first and the last two commits have serious conflicts. We need to find out what commits does it depend on.

# git cherry-pick 2c4e354f133a44b971837ea33b5f89d62302cb63 #RATIS-2232. Improve log for debugging on RaftLog / TransactionManager (#1203)
git cherry-pick 337df17c7ea27fbaac9f5f82f8557dc815830d7c #RATIS-2234. Remove lock race between heartbeat and append log channels (#1205)

git cherry-pick cf893f64906df82908fcc43aed2d575e52f7a174 #RATIS-2233. make NOPROGRESS timeout configurable (#1204)
# git cherry-pick 17ca6f41d0a577de2ecb452368c1a38b0c63d8b7 #RATIS-2235. Allow only one thread to perform appendLog  (#1206)
# git cherry-pick 5d3476f27650c13e94d6bbe5ccbfbc7ca4712eea #RATIS-2242. change consistency criteria of heartbeat during appendLog (#1215)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants