Preserve needSync Flag on Region Cache Reload Failure to Reduce Error Spikes During Store Issues

### Issue Description
Currently, when a region entry in the TiDB region cache is marked as needSync but fails to fetch updated region information from PD, the client is designed to fall back to returning stale region information instead of retrying or preserving the sync intent.
Relevant code:
https://github.com/tikv/client-go/blob/e84f1a780fa63c25b76b8813eee4d587904b8221/internal/locate/region_cache.go#L1541-L1551

This fallback mechanism works well in many scenarios—it helps ensure availability and allows some queries to proceed during transient PD issues.

However, it can cause higher error rates under certain conditions, especially for workloads with:

- Strict max_execution_time settings.
- TiKV store issues (e.g., EBS latency spikes or partial inaccessibility).
- Many regions marked as needSync.

In these cases, requests may access stale region information, hit TiKV RPC errors, then **timeout** or **fail without triggering region cache invalidation**.
For example, see:
https://github.com/tikv/client-go/blob/e84f1a780fa63c25b76b8813eee4d587904b8221/internal/locate/replica_selector.go#L488-L492

If the region cache entry was already marked as needSync and failed the reload, returning the old region info **without setting the needSync flag again** can lead to **repeated access to stale region info for subsequent requests**. This results in elevated error rates and increased latency unnecessarily.

### Proposed Enhancement
If a region cache entry is **already marked as needSync**, and the reload fails, the cache should **retain or re-set the needSync flag**. This ensures that subsequent requests are aware the region info is stale.


### Impact
This change maintains the benefit of availability during PD glitches, while also improving stability and staleness visibility when store-level issues exist.

- Reduces prolonged exposure to stale region data during store issues.
- Improves query reliability under constrained timeout conditions.
- Helps avoid unnecessary latency spikes and error accumulation.
- Keeps existing behavior intact for transient PD unavailability.


	} else if flags := r.resetSyncFlags(needReloadOnAccess \| needDelayedReloadReady); flags > 0 {
	// load region when it be marked as need reload.
	observeLoadRegion(tag, r, expired, flags)
	// NOTE: we can NOT use c.loadRegionByID(bo, r.GetID()) here because the new region (loaded by id) is not
	// guaranteed to contain the key. (ref: https://github.com/tikv/client-go/pull/1299)
	lr, err := c.loadRegion(bo, key, isEndKey)
	if err != nil {
	// ignore error and use old region info.
	logutil.Logger(bo.GetCtx()).Error("load region failure",
	zap.String("key", redact.Key(key)), zap.Error(err),
	zap.String("encode-key", redact.Key(c.codec.EncodeRegionKey(key))))

	if leader == nil {
	// The region may be during transferring leader.
	err = bo.Backoff(retry.BoRegionScheduling, errors.Errorf("no leader, ctx: %v", ctx))
	return err == nil, err
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Preserve needSync Flag on Region Cache Reload Failure to Reduce Error Spikes During Store Issues #1647

Issue Description

Proposed Enhancement

Impact

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preserve needSync Flag on Region Cache Reload Failure to Reduce Error Spikes During Store Issues #1647

Description

Issue Description

Proposed Enhancement

Impact

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions