-
Notifications
You must be signed in to change notification settings - Fork 889
Description
Describe the bug
In environments where YARP has to (re-)establish a lot of connections to the same domain and DNS resolution times are non-zero, DNS resolution times can become excessive leading to pathological behavior.
This is triggered by dotnet serializing all DNS requests for a domain under the assumption there is a local resolver cache that will be fast after the first query completes. This is generally not true in Linux/Kubernetes.
To Reproduce
- Get a DNS resolver with some latency (~5ms or so are sufficient)
- Add a backend forces reconnects
- Put sufficient load on the instance (e.g. >200rps @ 5ms)
- Observe DNS metrics. The time for resolution of the domain will keep climbing.
Further technical details
Found with YARP 2.3.0, dotnet 8 on AKS (Kubernetes, Linux). Probably not a problem on systems with built-in resolver caches, but should be true for any system where DNS has some latency. Especially in Kubernetes many queries are "slow" due to the way they setup the default search behavior for local domains. Ich checked an the serialization behavior seems to still be in place in dotnet 10 preview.
In our case the pathological behavior was triggered by a relatively short term backend overload triggering lots of reconnects. Due to the high rps this was sufficient to drive DNS resolution times over the connection timeout making it pretty much impossible for the instance to recover.
Imo it might make sense to provide some form of mitigation,workaround or guidance in YARP for this. I am also considering reporting this to dotnet runtime as I think it was written under the assumption of being in a windows environment with very fast local caching.