[router] Add Configurable L0 and L1 Tokenizer Caching #11688

slin1237 · 2025-10-16T01:18:37Z

Summary

This PR introduces a two-level tokenizer caching system that provides significant performance improvements for tokenization operations. The implementation features an L0 exact-match cache and an innovative L1 special-token boundary prefix cache

Motivation

Tokenization is a frequent operation in the router, and many requests share common patterns:

Repeated requests: Identical prompts or messages are tokenized multiple times
Common prefixes: System prompts, chat templates, and conversation history create shared prefix patterns
Performance opportunity: Caching tokenization results can significantly reduce redundant computation

The two-level cache design balances fast exact-match lookups (L0) with intelligent prefix reuse (L1).

Architecture

L0 Cache (Exact Match)

Purpose: Fast lookup for identical input strings
Implementation: DashMap-based concurrent hash map with lock-free reads
Configuration: l0_max_entries (default: 10,000 entries, ~22MB memory)
Use case: Repeated identical prompts, common system messages
Eviction: Simple arbitrary eviction when capacity reached

L1 Cache (Special Token Boundary Prefix Cache)

Purpose: Reuse tokenization results for shared prefixes at special token boundaries
Implementation: Re-tokenization approach - caches prefix tokens by re-tokenizing at ALL special token boundaries
Configuration: l1_max_memory (default: 50MB)
Use case: Chat conversations with shared system prompts, multi-turn interactions
Eviction: LRU eviction with memory tracking
Correctness: Guarantees 100% correctness by re-tokenizing prefixes (not storing raw strings)

How L1 Cache Works

L1 cache is a special-token boundary prefix cache that caches tokenization results at every special token boundary. Special tokens (like <|im_start|>, <|im_end|>, <|eot_id|>) are atomic in BPE tokenizers (special: true, normalized: false), making them the ONLY safe split points that guarantee correctness.

Why Special Tokens

BPE tokenizers make context-dependent merge decisions, so tokenize(prefix + suffix) != tokenize(prefix) + tokenize(suffix) for arbitrary boundaries. However, special tokens are atomic and protected from normalization/merging, guaranteeing:

tokenize(prefix + suffix) == tokenize(prefix) + tokenize(suffix)

when splitting at special token boundaries.

Cache Strategy - Re-Tokenization with All-Boundaries Approach

The L1 cache uses a re-tokenization approach to guarantee correctness. When inserting into the cache after a full tokenization:

Find all special token boundaries in the input string
For each boundary: Extract the prefix substring up to that boundary
Re-tokenize the prefix to get the exact token sequence (BPE-safe)
Cache the prefix tokens with Blake3 hash of the prefix string as the key

This approach ensures 100% correctness because we never assume tokenize(prefix) + tokenize(suffix) == tokenize(prefix + suffix). Instead, we always re-tokenize the prefix to get the actual token sequence that would result from tokenizing prefix + suffix up to the boundary.

Input: "<|im_start|>system\nYou are helpful.<|im_end|><|im_start|>user\nHello!<|im_end|>"

Special token boundaries found:
1. After "<|im_start|>" at position 13
2. After "<|im_end|>" at position 45
3. After "<|im_start|>" at position 58
4. After "<|im_end|>" at position 72

For each boundary:
- Extract prefix string (e.g., "<|im_start|>system\nYou are helpful.<|im_end|>")
- Re-tokenize prefix → get exact token IDs
- Cache: hash(prefix_string) → prefix_tokens

Cache Lookup Example:

Input: "<|im_start|>system\nYou are helpful.<|im_end|><|im_start|>user\nWhat is 2+2?<|im_end|>"

1. Find all special token boundaries in input string
2. For each boundary (longest to shortest):
   - Extract prefix substring up to boundary
   - Hash prefix with Blake3
   - Look up hash in cache
   - If HIT: Use cached prefix tokens + tokenize remaining suffix → merge and return
   - If MISS: Try next shorter prefix
3. If no prefix match:
   - Tokenize full string
   - Re-tokenize and cache prefixes at ALL boundaries for future requests

Key Performance Insight: On cache hit, we avoid re-tokenizing the prefix. On cache miss, we pay the cost of re-tokenizing prefixes once during insertion, then all future requests with that prefix benefit from the cached tokens.

Opt-In

Both caches are disabled by default to maintain backward compatibility. The CachedTokenizer wrapper is only created when at least one cache is enabled; otherwise, the base tokenizer is used directly.

Changes Made

Tokenizer Factory (src/tokenizer/factory.rs)

Fixed special token detection in HuggingFace tokenizer
Properly extracts added_tokens with special: true property
Ensures special tokens are correctly identified for L1 cache

CLI & Configuration

CLI Arguments (src/main.rs)

--tokenizer-cache-enable-l0              # Enable L0 exact-match cache
--tokenizer-cache-l0-max-entries <N>     # L0 cache size (default: 10000)
--tokenizer-cache-enable-l1              # Enable L1 prefix cache
--tokenizer-cache-l1-max-memory <BYTES>  # L1 max memory (default: 52428800)

Bug Fixes

Special Token Detection (src/tokenizer/factory.rs)

Fixed HuggingFace tokenizer special token extraction
Properly filters added_tokens for special: true property
Ensures L1 cache has correct special tokens to work with

Performance Results

benchmarks using Qwen3-4B-Instruct-2507 tokenizer on different workloads:

Benchmark Scenarios

Scenario	Configuration	Throughput	Latency	L1 Hit Rate	L0 Hit Rate	Speedup vs Baseline
Realistic Chat (95%+ prefix reuse)	L0+L1	18,751 ops/sec	53.3µs	99.2%	0.8%	18.2x
Customer Service (100% prefix reuse)	L1-only	21,359 ops/sec	46.8µs	88.9%	N/A	22.7x
Customer Service (100% prefix reuse)	L0+L1	19,494 ops/sec	51.2µs	87.5%	11.1%	21.1x
Multi-turn Conversation	L0+L1	4,469 ops/sec	223.8µs	66.7%	0.0%	4.3x
Code Review	L0+L1	21,113 ops/sec	47.4µs	75.0%	0.0%	21.1x
Baseline (no cache)	None	940 ops/sec	1,064µs	N/A	N/A	1.0x

Future Work

Potential enhancements:

Metrics & observability: Expose Prometheus metrics for cache hit/miss rates
Per-model configuration: Different cache settings for different tokenizers
Cache warming: Pre-populate cache with common patterns at startup
Dynamic tuning: Auto-adjust cache sizes based on workload
Cross-request sharing: Share cache across multiple router instances

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-16T01:19:04Z

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and configurable caching layer for the tokenizer within the router, addressing the performance bottleneck of frequent tokenization operations. By implementing both an exact-match (L0) and a prefix-matching (L1) cache, the system can efficiently reuse tokenization results for identical prompts and common conversational prefixes. This enhancement aims to significantly reduce computational overhead and improve overall throughput, especially in workloads characterized by repetitive inputs or shared conversational contexts, while remaining opt-in for user control.

Highlights

Two-Level Tokenizer Caching: Introduced a configurable two-level tokenizer caching mechanism to significantly reduce redundant tokenization computation. This includes an L0 cache for exact string matches and an L1 cache for shared prefix matching.
L0 Exact Match Cache: Implemented an L0 cache that stores full tokenization results for identical input strings, providing fast lookups for repeated prompts and common system messages. It is configured by l0_max_entries (default: 10,000 entries).
L1 Fixed-Boundary Prefix Cache: Developed an L1 cache designed to reuse tokenization results for shared prefixes, particularly beneficial for chat conversations with common system prompts or history. It uses Blake3 hashing, fixed 128-byte boundaries, and LRU eviction, configurable by l1_max_memory (default: 50MB) and l1_granularity (default: 128 bytes).
Opt-In Design and Configuration: Both L0 and L1 caches are disabled by default to maintain backward compatibility. They can be enabled and configured via new CLI arguments, Python bindings, or configuration files, ensuring flexibility for users.
Performance Benchmarks: Included new benchmarks demonstrating substantial performance improvements, with L1 cache achieving 70-85% hit rates for chat templates and 2-6x speedup for requests with cached prefixes in production-scale scenarios.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a two-level tokenizer caching system (L0 for exact matches, L1 for prefixes) to enhance performance by reducing redundant tokenization. The implementation is well-structured, with new configurations, CLI arguments, and extensive benchmarks demonstrating significant speedups.

My review focuses on a critical correctness issue in the L1 cache logic that could lead to silent tokenization errors, along with suggestions for improving cache eviction performance and fixing minor issues in the benchmarks and Python argument help text. Overall, this is a valuable performance enhancement, but the L1 cache's correctness needs to be addressed before it's enabled in production.

sgl-router/src/tokenizer/cache/mod.rs

sgl-router/py_src/sglang_router/router_args.py

sgl-router/src/config/validation.rs

sgl-router/src/tokenizer/cache/l0.rs

sgl-router/src/tokenizer/cache/l1.rs

sgl-router/benches/tokenizer_benchmark.rs

sgl-router/src/tokenizer/cache/mod.rs

slin1237 · 2025-10-16T22:30:46Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a well-designed two-level tokenizer caching system, which is a significant performance enhancement. The implementation is thorough, with new configurations, validation, and extensive benchmarks. The L0 exact-match cache and L1 prefix cache are both valuable additions. The code is generally high quality. My review focuses on a few areas for improvement regarding correctness, performance, and maintainability.

sgl-router/src/tokenizer/cache/l1.rs

sgl-router/src/tokenizer/cache/mod.rs

sgl-router/benches/tokenizer_benchmark.rs

sgl-router/src/tokenizer/cache/l1.rs

sgl-router/src/tokenizer/cache/mod.rs

sgl-router/src/tokenizer/huggingface.rs

slin1237 · 2025-10-17T02:07:25Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces a well-designed and configurable two-level tokenizer caching system (L0 for exact matches, L1 for prefix matches), which promises significant performance improvements for tokenization operations. The implementation is robust, using concurrent data structures and sound caching strategies. The addition of comprehensive benchmarks is excellent for validating the performance gains. The configuration options are cleanly integrated into both the CLI and Python arguments.

I've identified a couple of minor opportunities for refactoring to improve code maintainability and reduce redundancy, particularly in the new benchmark suite and the cache implementation. Overall, this is a high-quality contribution that will bring tangible performance benefits.

sgl-router/benches/tokenizer_benchmark.rs

sgl-router/src/tokenizer/cache/mod.rs

CatherineSue · 2025-10-18T20:09:33Z

Left: CachedTokenizer vs Right: Non-cached
Model: Llama-4-Maverick-FP8
Hardware: 8*H100
Benchmark: lm_eval mmlu_pro

slin1237 requested review from ByronHsu and CatherineSue as code owners October 16, 2025 01:18

sglang-bot added the run-ci label Oct 16, 2025

slin1237 added router router-benchmark enhancement New feature or request labels Oct 16, 2025

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

zhyncs requested changes Oct 16, 2025

View reviewed changes

sgl-router/src/tokenizer/cache/mod.rs Outdated Show resolved Hide resolved

slin1237 requested a review from zhyncs October 16, 2025 17:49

slin1237 force-pushed the tokenizer branch from d6f1cfa to 03b09cf Compare October 16, 2025 18:25

gemini-code-assist bot reviewed Oct 16, 2025

View reviewed changes

slin1237 force-pushed the tokenizer branch 2 times, most recently from 050fd78 to a526723 Compare October 17, 2025 02:07

gemini-code-assist bot reviewed Oct 17, 2025

View reviewed changes

sgl-router/benches/tokenizer_benchmark.rs Show resolved Hide resolved

sgl-router/src/tokenizer/cache/mod.rs Show resolved Hide resolved

slin1237 added the high priority label Oct 17, 2025

slin1237 mentioned this pull request Oct 17, 2025

[router] 0.2.1 release #11771

Open

slin1237 force-pushed the tokenizer branch from 027447d to 110299f Compare October 18, 2025 01:36

slin1237 added 5 commits October 17, 2025 20:57

[router] add l0+l1 tokenization cache

6e0d4b4

fixup

5f6c863

fixup

229029b

fixup

451f39f

fixup

5648561

slin1237 force-pushed the tokenizer branch from 110299f to 5648561 Compare October 18, 2025 03:57

CatherineSue added 2 commits October 18, 2025 11:57

Fix process_chat_messages to work with cached tokenizer

4cc7d49

fmt files

b966565

slin1237 merged commit a7ae61e into main Oct 19, 2025
34 checks passed

slin1237 deleted the tokenizer branch October 19, 2025 01:33

[router] Add Configurable L0 and L1 Tokenizer Caching #11688

[router] Add Configurable L0 and L1 Tokenizer Caching #11688

Conversation

slin1237 commented Oct 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Architecture

L0 Cache (Exact Match)

L1 Cache (Special Token Boundary Prefix Cache)

How L1 Cache Works

Opt-In

Changes Made

CLI & Configuration

Bug Fixes

Performance Results

Benchmark Scenarios

Future Work

Checklist

Uh oh!

gemini-code-assist bot commented Oct 16, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slin1237 commented Oct 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

slin1237 commented Oct 17, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

CatherineSue commented Oct 18, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

slin1237 commented Oct 16, 2025 •

edited

Loading