-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Description
We're using the Microsoft.Extensions.Caching.Redis package, and with an aggressive syncTimeout configuration (f.ex. 10ms), we're having some issues where Redis sometimes crashes the host on startup, even with abortConnect=false and the standard 5s connectTimeout.
Logs look like this:
"level": "error",
"trace_id": "",
"message": "Hosting failed to start",
"context": {
"EventId": "{\"Id\":11,\"Name\":\"HostedServiceStartupFaulted\"}",
"SourceContext": "Microsoft.Extensions.Hosting.Internal.Host",
"Environment": "Development",
"exception": {
"message": "The message timed out in the backlog attempting to send because no connection became available (10ms) - Last Connection Exception: UnableToConnect on <REDACTED>-redis-headless.<REDACTED>.svc.cluster.local:6379/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.7.27.49176, command=ROLE, timeout: 10, inst: 0, qu: 1, qs: 0, aw: False, bw: Inactive, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 0, serverEndpoint: <REDACTED>-redis-headless.<REDACTED>.svc.cluster.local:6379, conn-sec: n/a, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: <REDACTED>(SE.Redis-v2.7.27.49176), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=138,Timers=11), v: 2.7.27.49176 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts),
"stacktrace": " at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2105\n at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/RedisBase.cs:line 62\n at StackExchange.Redis.RedisServer.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/RedisServer.cs:line 738\n at StackExchange.Redis.RedisServer.Role(CommandFlags flags) in /_/src/StackExchange.Redis/RedisServer.cs:line 424\n at StackExchange.Redis.ConnectionMultiplexer.GetSentinelMasterConnection(ConfigurationOptions config, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 237\n at StackExchange.Redis.ConnectionMultiplexer.SentinelPrimaryConnect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 134\n at StackExchange.Redis.ConnectionMultiplexer.Connect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 685\n at
....
From what I can tell, the culpit is the ROLE command issued in GetSentinelMasterConnection which sometimes times out and throws a RedisTimeoutException which is not handled, and thus escapes the retry logic that should run until connectTimeout.
Proposed fix
I have a PR in my fork which reproduces and fixes the issue.
I found it hard to write a test that reliably reproduced the issue, so I simulated it by hardcoding a random delay for ROLE commands in Message.Complete().
The fix seems pretty simple - just catch the RedisTimeoutException in GetSentinelMasterConnection and allow the while loop to continue.
Let me know if you want me to create a PR here for the fix (without a test), or if you have any ideas on how to write a test for the scenario I can have a look at including that :)