-
Notifications
You must be signed in to change notification settings - Fork 246
Description
Issue Description
Currently, when a region entry in the TiDB region cache is marked as needSync but fails to fetch updated region information from PD, the client is designed to fall back to returning stale region information instead of retrying or preserving the sync intent.
Relevant code:
client-go/internal/locate/region_cache.go
Lines 1541 to 1551 in e84f1a7
| } else if flags := r.resetSyncFlags(needReloadOnAccess | needDelayedReloadReady); flags > 0 { | |
| // load region when it be marked as need reload. | |
| observeLoadRegion(tag, r, expired, flags) | |
| // NOTE: we can NOT use c.loadRegionByID(bo, r.GetID()) here because the new region (loaded by id) is not | |
| // guaranteed to contain the key. (ref: https://github.com/tikv/client-go/pull/1299) | |
| lr, err := c.loadRegion(bo, key, isEndKey) | |
| if err != nil { | |
| // ignore error and use old region info. | |
| logutil.Logger(bo.GetCtx()).Error("load region failure", | |
| zap.String("key", redact.Key(key)), zap.Error(err), | |
| zap.String("encode-key", redact.Key(c.codec.EncodeRegionKey(key)))) |
This fallback mechanism works well in many scenarios—it helps ensure availability and allows some queries to proceed during transient PD issues.
However, it can cause higher error rates under certain conditions, especially for workloads with:
- Strict max_execution_time settings.
- TiKV store issues (e.g., EBS latency spikes or partial inaccessibility).
- Many regions marked as needSync.
In these cases, requests may access stale region information, hit TiKV RPC errors, then timeout or fail without triggering region cache invalidation.
For example, see:
client-go/internal/locate/replica_selector.go
Lines 488 to 492 in e84f1a7
| if leader == nil { | |
| // The region may be during transferring leader. | |
| err = bo.Backoff(retry.BoRegionScheduling, errors.Errorf("no leader, ctx: %v", ctx)) | |
| return err == nil, err | |
| } |
If the region cache entry was already marked as needSync and failed the reload, returning the old region info without setting the needSync flag again can lead to repeated access to stale region info for subsequent requests. This results in elevated error rates and increased latency unnecessarily.
Proposed Enhancement
If a region cache entry is already marked as needSync, and the reload fails, the cache should retain or re-set the needSync flag. This ensures that subsequent requests are aware the region info is stale.
Impact
This change maintains the benefit of availability during PD glitches, while also improving stability and staleness visibility when store-level issues exist.
- Reduces prolonged exposure to stale region data during store issues.
- Improves query reliability under constrained timeout conditions.
- Helps avoid unnecessary latency spikes and error accumulation.
- Keeps existing behavior intact for transient PD unavailability.