Description
The bug is in the following lines of code
github-api/src/main/java/org/kohsuke/github/GitHubClient.java
Lines 47 to 52 in 1cb9e66
It will only retry twice with 200ms in between retries.
Why this is a bug
GitHub branch source Jenkins plugin will drop builds occasionally from received webhooks. The GHBS plugin relies on GitHub plugin which relies on github-api plugin which provides this library as a client. Here's an exception from multibranch pipeline events.
[Mon Oct 16 14:31:28 GMT 2023] Received Push event to branch master in repository REDACTED UPDATED event from REDACTED ⇒ https://jenkins-webhooks.REDACTED.com/github-webhook/ with timestamp Mon Oct 16 14:31:22 GMT 2023
14:31:26 Connecting to https://api.github.com using GitHub app
ERROR: Ran out of retries for URL: https://api.github.com/repos/REDACTED
org.kohsuke.github.GHIOException: Ran out of retries for URL: https://api.github.com/repos/REDACTED
at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:456)
at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:403)
at org.kohsuke.github.Requester.fetch(Requester.java:85)
at org.kohsuke.github.GHRepository.read(GHRepository.java:145)
at org.kohsuke.github.GitHub.getRepository(GitHub.java:684)
at org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.retrieve(GitHubSCMSource.java:1005)
at jenkins.scm.api.SCMSource._retrieve(SCMSource.java:372)
at jenkins.scm.api.SCMSource.fetch(SCMSource.java:326)
at jenkins.branch.MultiBranchProject$SCMEventListenerImpl.processHeadUpdate(MultiBranchProject.java:1614)
at jenkins.branch.MultiBranchProject$SCMEventListenerImpl.onSCMHeadEvent(MultiBranchProject.java:1218)
at jenkins.scm.api.SCMHeadEvent$DispatcherImpl.fire(SCMHeadEvent.java:246)
at jenkins.scm.api.SCMHeadEvent$DispatcherImpl.fire(SCMHeadEvent.java:229)
at jenkins.scm.api.SCMEvent$Dispatcher.run(SCMEvent.java:545)
at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
How many retries should it be?
In practice, rolling out this patch from another plugin not using this library but does interact with GitHub I've seen GitHub API retried 28 times over the course of 1 minute. The retry limit for this was 30 and sleeping randomly between 1000 and 3000 ms. I've since increased the retry cap to 60 in my production system for this plugin.
Ideal solution
You can keep the same retries but I would like this to be tunable by the user via system properties on the fly. That means do not use static properties except as a default value.
Integer minInterval = Integer.getInteger(GitHubClient.class.getName() + ".minRetryInterval", retryTimeoutMillis);
Integer maxInterval = Integer.getInteger(GitHubClient.class.getName() + ".maxRetryInterval", retryTimeoutMillis) + 1;
Integer retryLimit = Integer.getInteger(GitHubClient.class.getName() + ".retryLimit", CONNECTION_ERROR_RETRIES);
And for sleeping between retries I would like it to be random instead of a fixed value.
// import java.util.concurrent.ThreadLocalRandom
private static void logRetryConnectionError(IOException e, URL url, int retries) throws IOException {
Integer minInterval = Integer.getInteger(GitHubClient.class.getName() + ".minRetryInterval", retryTimeoutMillis);
Integer maxInterval = Integer.getInteger(GitHubClient.class.getName() + ".maxRetryInterval", retryTimeoutMillis) + 1;
Integer sleepyTime = ThreadLocalRandom.current().nextLong(minInterval, maxInterval);
// There are a range of connection errors where we want to wait a moment and just automatically retry
LOGGER.log(INFO,
e.getMessage() + " while connecting to " + url + ". Sleeping " + sleepyTime
+ " milliseconds before retrying... ; will try " + retryLimit + " more time(s)");
try {
Thread.sleep(sleepyTime);
} catch (InterruptedException ie) {
throw (IOException) new InterruptedIOException().initCause(e);
}
}
This should have a sane default but allow clients to tune them on the fly. Random delay between retries is a cloud best practice for interacting with distributed systems. See also https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/