Skip to content

ROLE timeout in GetSentinelMasterConnection #2938

@frekals

Description

@frekals

We're using the Microsoft.Extensions.Caching.Redis package, and with an aggressive syncTimeout configuration (f.ex. 10ms), we're having some issues where Redis sometimes crashes the host on startup, even with abortConnect=false and the standard 5s connectTimeout.

Logs look like this:

"level": "error",
  "trace_id": "",
  "message": "Hosting failed to start",
  "context": {
    "EventId": "{\"Id\":11,\"Name\":\"HostedServiceStartupFaulted\"}",
    "SourceContext": "Microsoft.Extensions.Hosting.Internal.Host",
    "Environment": "Development",
    "exception": {
      "message": "The message timed out in the backlog attempting to send because no connection became available (10ms) - Last Connection Exception: UnableToConnect on <REDACTED>-redis-headless.<REDACTED>.svc.cluster.local:6379/Interactive, Initializing/NotStarted, last: NONE, origin: BeginConnectAsync, outstanding: 0, last-read: 0s ago, last-write: 0s ago, keep-alive: 60s, state: Connecting, mgr: 10 of 10 available, last-heartbeat: never, global: 0s ago, v: 2.7.27.49176, command=ROLE, timeout: 10, inst: 0, qu: 1, qs: 0, aw: False, bw: Inactive, last-in: 0, cur-in: 0, sync-ops: 1, async-ops: 0, serverEndpoint: <REDACTED>-redis-headless.<REDACTED>.svc.cluster.local:6379, conn-sec: n/a, aoc: 0, mc: 1/1/0, mgr: 10 of 10 available, clientName: <REDACTED>(SE.Redis-v2.7.27.49176), IOCP: (Busy=0,Free=1000,Min=1,Max=1000), WORKER: (Busy=1,Free=32766,Min=8,Max=32767), POOL: (Threads=8,QueuedItems=0,CompletedItems=138,Timers=11), v: 2.7.27.49176 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts),
      "stacktrace": "   at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 2105\n   at StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/RedisBase.cs:line 62\n   at StackExchange.Redis.RedisServer.ExecuteSync[T](Message message, ResultProcessor`1 processor, ServerEndPoint server, T defaultValue) in /_/src/StackExchange.Redis/RedisServer.cs:line 738\n   at StackExchange.Redis.RedisServer.Role(CommandFlags flags) in /_/src/StackExchange.Redis/RedisServer.cs:line 424\n   at StackExchange.Redis.ConnectionMultiplexer.GetSentinelMasterConnection(ConfigurationOptions config, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 237\n   at StackExchange.Redis.ConnectionMultiplexer.SentinelPrimaryConnect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.Sentinel.cs:line 134\n   at StackExchange.Redis.ConnectionMultiplexer.Connect(ConfigurationOptions configuration, TextWriter log) in /_/src/StackExchange.Redis/ConnectionMultiplexer.cs:line 685\n   at 
	  
	  ....

From what I can tell, the culpit is the ROLE command issued in GetSentinelMasterConnection which sometimes times out and throws a RedisTimeoutException which is not handled, and thus escapes the retry logic that should run until connectTimeout.

Proposed fix

I have a PR in my fork which reproduces and fixes the issue.

I found it hard to write a test that reliably reproduced the issue, so I simulated it by hardcoding a random delay for ROLE commands in Message.Complete().

The fix seems pretty simple - just catch the RedisTimeoutException in GetSentinelMasterConnection and allow the while loop to continue.

Let me know if you want me to create a PR here for the fix (without a test), or if you have any ideas on how to write a test for the scenario I can have a look at including that :)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions