Skip to content

GitHub client is fragile with recent GitHub API flakiness #1728

Closed
@samrocketman

Description

@samrocketman

The bug is in the following lines of code

/** The Constant CONNECTION_ERROR_RETRIES. */
static final int CONNECTION_ERROR_RETRIES = 2;
/**
* If timeout issues let's retry after milliseconds.
*/
static final int retryTimeoutMillis = 100;

It will only retry twice with 200ms in between retries.

Why this is a bug

GitHub branch source Jenkins plugin will drop builds occasionally from received webhooks. The GHBS plugin relies on GitHub plugin which relies on github-api plugin which provides this library as a client. Here's an exception from multibranch pipeline events.

[Mon Oct 16 14:31:28 GMT 2023] Received Push event to branch master in repository REDACTED UPDATED event from REDACTED ⇒ https://jenkins-webhooks.REDACTED.com/github-webhook/ with timestamp Mon Oct 16 14:31:22 GMT 2023
14:31:26 Connecting to https://api.github.com using GitHub app
ERROR: Ran out of retries for URL: https://api.github.com/repos/REDACTED
org.kohsuke.github.GHIOException: Ran out of retries for URL: https://api.github.com/repos/REDACTED
	at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:456)
	at org.kohsuke.github.GitHubClient.sendRequest(GitHubClient.java:403)
	at org.kohsuke.github.Requester.fetch(Requester.java:85)
	at org.kohsuke.github.GHRepository.read(GHRepository.java:145)
	at org.kohsuke.github.GitHub.getRepository(GitHub.java:684)
	at org.jenkinsci.plugins.github_branch_source.GitHubSCMSource.retrieve(GitHubSCMSource.java:1005)
	at jenkins.scm.api.SCMSource._retrieve(SCMSource.java:372)
	at jenkins.scm.api.SCMSource.fetch(SCMSource.java:326)
	at jenkins.branch.MultiBranchProject$SCMEventListenerImpl.processHeadUpdate(MultiBranchProject.java:1614)
	at jenkins.branch.MultiBranchProject$SCMEventListenerImpl.onSCMHeadEvent(MultiBranchProject.java:1218)
	at jenkins.scm.api.SCMHeadEvent$DispatcherImpl.fire(SCMHeadEvent.java:246)
	at jenkins.scm.api.SCMHeadEvent$DispatcherImpl.fire(SCMHeadEvent.java:229)
	at jenkins.scm.api.SCMEvent$Dispatcher.run(SCMEvent.java:545)
	at jenkins.security.ImpersonatingScheduledExecutorService$1.run(ImpersonatingScheduledExecutorService.java:67)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

How many retries should it be?

In practice, rolling out this patch from another plugin not using this library but does interact with GitHub I've seen GitHub API retried 28 times over the course of 1 minute. The retry limit for this was 30 and sleeping randomly between 1000 and 3000 ms. I've since increased the retry cap to 60 in my production system for this plugin.

jenkinsci/scm-filter-jervis-plugin@c21d0c1#diff-46da075f8e1e17ccf594a86bc96e8c8eaf295617b81d5ae5206634e37021bb49R119-R122

Ideal solution

You can keep the same retries but I would like this to be tunable by the user via system properties on the fly. That means do not use static properties except as a default value.

Integer minInterval = Integer.getInteger(GitHubClient.class.getName() + ".minRetryInterval", retryTimeoutMillis);
Integer maxInterval = Integer.getInteger(GitHubClient.class.getName() + ".maxRetryInterval", retryTimeoutMillis) + 1;
Integer retryLimit = Integer.getInteger(GitHubClient.class.getName() + ".retryLimit", CONNECTION_ERROR_RETRIES);

And for sleeping between retries I would like it to be random instead of a fixed value.

// import java.util.concurrent.ThreadLocalRandom
    private static void logRetryConnectionError(IOException e, URL url, int retries) throws IOException {
        Integer minInterval = Integer.getInteger(GitHubClient.class.getName() + ".minRetryInterval", retryTimeoutMillis);
        Integer maxInterval = Integer.getInteger(GitHubClient.class.getName() + ".maxRetryInterval", retryTimeoutMillis) + 1;
        Integer sleepyTime = ThreadLocalRandom.current().nextLong(minInterval, maxInterval);
        // There are a range of connection errors where we want to wait a moment and just automatically retry
        LOGGER.log(INFO,
                e.getMessage() + " while connecting to " + url + ". Sleeping " + sleepyTime
                        + " milliseconds before retrying... ; will try " + retryLimit + " more time(s)");
        try {
            Thread.sleep(sleepyTime);
        } catch (InterruptedException ie) {
            throw (IOException) new InterruptedIOException().initCause(e);
        }
    }

This should have a sane default but allow clients to tune them on the fly. Random delay between retries is a cloud best practice for interacting with distributed systems. See also https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions