Skip to content

Bug fix: Avoid race condition by checking for duplicate entries in bucket stores user scan #6863

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented Jul 4, 2025

What this PR does:

We saw store gateways terminating due to map concurrent write:

fatal error: concurrent map writes

I was able to reproduce the race condition by forcing scanUsers to return duplicate user IDs in the list:

func TestBucketStores_RaceCheck(t *testing.T) {
	t.Parallel()

	type job struct {
		userID string
		store  *store.BucketStore
	}

	wg := &sync.WaitGroup{}
	jobs := make(chan job)

	const userID = "user-1"
	ctx := context.Background()
	cfg := prepareStorageConfig(t)
	storageDir := t.TempDir()

	bkt, err := filesystem.NewBucketClient(filesystem.Config{Directory: storageDir})
	require.NoError(t, err)

	reg := prometheus.NewPedanticRegistry()
	stores, err := NewBucketStores(cfg, NewNoShardingStrategy(log.NewNopLogger(), nil), objstore.WithNoopInstr(bkt), defaultLimitsOverrides(t), mockLoggingLevel(), log.NewNopLogger(), reg)
	require.NoError(t, err)

	for i := 0; i < 10; i++ {
		wg.Add(1)
		go func() {
			defer wg.Done()

			for job := range jobs {
				job.store.SyncBlocks(ctx)
			}
		}()
	}

	for i := 0; i < 10; i++ {
		bs, _ := stores.getOrCreateStore(userID)
		jobs <- job{userID: userID, store: bs}
	}

	close(jobs)
	wg.Wait()
}
==================
WARNING: DATA RACE
Write at 0x00c0008962c0 by goroutine 80:
  github.com/cortexproject/cortex/pkg/storegateway.(*shardingMetadataFilterAdapter).Filter()
      cortex/pkg/storegateway/sharding_strategy.go:280 +0xfc
  github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:617 +0x54c
  github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:658 +0xd4
  github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks()
      cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:737 +0x60
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck.func1()
      cortex/pkg/storegateway/bucket_stores_test.go:487 +0xb8

Previous write at 0x00c0008962c0 by goroutine 78:
  github.com/cortexproject/cortex/pkg/storegateway.(*shardingMetadataFilterAdapter).Filter()
      cortex/pkg/storegateway/sharding_strategy.go:280 +0xfc
  github.com/thanos-io/thanos/pkg/block.(*BaseFetcher).fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:617 +0x54c
  github.com/thanos-io/thanos/pkg/block.(*MetaFetcher).Fetch()
      cortex/vendor/github.com/thanos-io/thanos/pkg/block/fetcher.go:658 +0xd4
  github.com/thanos-io/thanos/pkg/store.(*BucketStore).SyncBlocks()
      cortex/vendor/github.com/thanos-io/thanos/pkg/store/bucket.go:737 +0x60
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck.func1()
      cortex/pkg/storegateway/bucket_stores_test.go:487 +0xb8

Goroutine 80 (running) created at:
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck()
      cortex/pkg/storegateway/bucket_stores_test.go:483 +0x3cc
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1851 +0x40

Goroutine 78 (running) created at:
  github.com/cortexproject/cortex/pkg/storegateway.TestBucketStores_RaceCheck()
      cortex/pkg/storegateway/bucket_stores_test.go:483 +0x3cc
  testing.tRunner()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1792 +0x180
  testing.(*T).Run.gowrap1()
      /opt/homebrew/Cellar/go/1.24.1/libexec/src/testing/testing.go:1851 +0x40
==================

To avoid such fatal error, I've added a check for duplicate users.

Which issue(s) this PR fixes:
n/a

Checklist

  • Tests updated
  • [n/a] Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@justinjung04 justinjung04 changed the title Make scanUser always return unique user IDs to avoid concurrent map w… Bug fix: Check for duplicate entries in bucket store user scan Jul 4, 2025
@justinjung04 justinjung04 changed the title Bug fix: Check for duplicate entries in bucket store user scan Bug fix: Check for duplicate entries in bucket stores user scan Jul 4, 2025
@justinjung04 justinjung04 changed the title Bug fix: Check for duplicate entries in bucket stores user scan Bug fix: Check for duplicate entries in bucket stores user scan to avoid race condition Jul 4, 2025
@justinjung04 justinjung04 changed the title Bug fix: Check for duplicate entries in bucket stores user scan to avoid race condition Bug fix: Avoid race condition by checking for duplicate entries in bucket stores user scan Jul 4, 2025
Signed-off-by: Justin Jung <[email protected]>
@SungJin1212
Copy link
Member

Thanks for catching!
How could this scenario arise?

@yeya24
Copy link
Contributor

yeya24 commented Jul 7, 2025

Umm, I think this becomes unnecessary now after refactoring the user scanner code. Before the new user scanner, in Store Gateway we simply gets all users by listing all prefixes in object store. There is no deduplication logic in the scanUsers function below.

func (u *BucketStores) scanUsers(ctx context.Context) ([]string, error) {
	var users []string

	// Iterate the bucket to find all users in the bucket. Due to how the bucket listing
	// caching works, it's more likely to have a cache hit if there's no delay while
	// iterating the bucket, so we do load all users in memory and later process them.
	err := u.bucket.Iter(ctx, "", func(s string) error {
		users = append(users, strings.TrimSuffix(s, "/"))
		return nil
	})
        return users, err
}

After #6780, we introduced maps to deduplicate users from object store List API response so seems it is impossible to happen anymore. https://github.com/cortexproject/cortex/blob/master/pkg/storage/tsdb/users/scanner.go#L62

@justinjung04
Copy link
Contributor Author

Thanks for catching! How could this scenario arise?

We saw this before #6780, when in certain edge case scanUsers were returning duplicate entries for the users.

@justinjung04
Copy link
Contributor Author

After #6780, we introduced maps to deduplicate users from object store List API response so seems it is impossible to happen anymore. https://github.com/cortexproject/cortex/blob/master/pkg/storage/tsdb/users/scanner.go#L62

Makes sense, but is there a way to force Iter to always return unique elements? For example, could someone extend Iter which causes store gateway to have runtime error again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants