Skip to content

Preserve needSync Flag on Region Cache Reload Failure to Reduce Error Spikes During Store Issues #1647

@HaoW30

Description

@HaoW30

Issue Description

Currently, when a region entry in the TiDB region cache is marked as needSync but fails to fetch updated region information from PD, the client is designed to fall back to returning stale region information instead of retrying or preserving the sync intent.
Relevant code:

} else if flags := r.resetSyncFlags(needReloadOnAccess | needDelayedReloadReady); flags > 0 {
// load region when it be marked as need reload.
observeLoadRegion(tag, r, expired, flags)
// NOTE: we can NOT use c.loadRegionByID(bo, r.GetID()) here because the new region (loaded by id) is not
// guaranteed to contain the key. (ref: https://github.com/tikv/client-go/pull/1299)
lr, err := c.loadRegion(bo, key, isEndKey)
if err != nil {
// ignore error and use old region info.
logutil.Logger(bo.GetCtx()).Error("load region failure",
zap.String("key", redact.Key(key)), zap.Error(err),
zap.String("encode-key", redact.Key(c.codec.EncodeRegionKey(key))))

This fallback mechanism works well in many scenarios—it helps ensure availability and allows some queries to proceed during transient PD issues.

However, it can cause higher error rates under certain conditions, especially for workloads with:

  • Strict max_execution_time settings.
  • TiKV store issues (e.g., EBS latency spikes or partial inaccessibility).
  • Many regions marked as needSync.

In these cases, requests may access stale region information, hit TiKV RPC errors, then timeout or fail without triggering region cache invalidation.
For example, see:

if leader == nil {
// The region may be during transferring leader.
err = bo.Backoff(retry.BoRegionScheduling, errors.Errorf("no leader, ctx: %v", ctx))
return err == nil, err
}

If the region cache entry was already marked as needSync and failed the reload, returning the old region info without setting the needSync flag again can lead to repeated access to stale region info for subsequent requests. This results in elevated error rates and increased latency unnecessarily.

Proposed Enhancement

If a region cache entry is already marked as needSync, and the reload fails, the cache should retain or re-set the needSync flag. This ensures that subsequent requests are aware the region info is stale.

Impact

This change maintains the benefit of availability during PD glitches, while also improving stability and staleness visibility when store-level issues exist.

  • Reduces prolonged exposure to stale region data during store issues.
  • Improves query reliability under constrained timeout conditions.
  • Helps avoid unnecessary latency spikes and error accumulation.
  • Keeps existing behavior intact for transient PD unavailability.

Metadata

Metadata

Assignees

No one assigned

    Labels

    affects-8.5This bug affects the 8.5.x(LTS) versions.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions