-
Notifications
You must be signed in to change notification settings - Fork 3.1k
[router] Add Configurable L0 and L1 Tokenizer Caching #11688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a robust and configurable caching layer for the tokenizer within the router, addressing the performance bottleneck of frequent tokenization operations. By implementing both an exact-match (L0) and a prefix-matching (L1) cache, the system can efficiently reuse tokenization results for identical prompts and common conversational prefixes. This enhancement aims to significantly reduce computational overhead and improve overall throughput, especially in workloads characterized by repetitive inputs or shared conversational contexts, while remaining opt-in for user control. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a two-level tokenizer caching system (L0 for exact matches, L1 for prefixes) to enhance performance by reducing redundant tokenization. The implementation is well-structured, with new configurations, CLI arguments, and extensive benchmarks demonstrating significant speedups.
My review focuses on a critical correctness issue in the L1 cache logic that could lead to silent tokenization errors, along with suggestions for improving cache eviction performance and fixing minor issues in the benchmarks and Python argument help text. Overall, this is a valuable performance enhancement, but the L1 cache's correctness needs to be addressed before it's enabled in production.
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a well-designed two-level tokenizer caching system, which is a significant performance enhancement. The implementation is thorough, with new configurations, validation, and extensive benchmarks. The L0 exact-match cache and L1 prefix cache are both valuable additions. The code is generally high quality. My review focuses on a few areas for improvement regarding correctness, performance, and maintainability.
050fd78
to
a526723
Compare
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a well-designed and configurable two-level tokenizer caching system (L0 for exact matches, L1 for prefix matches), which promises significant performance improvements for tokenization operations. The implementation is robust, using concurrent data structures and sound caching strategies. The addition of comprehensive benchmarks is excellent for validating the performance gains. The configuration options are cleanly integrated into both the CLI and Python arguments.
I've identified a couple of minor opportunities for refactoring to improve code maintainability and reduce redundancy, particularly in the new benchmark suite and the cache implementation. Overall, this is a high-quality contribution that will bring tangible performance benefits.
Summary
This PR introduces a two-level tokenizer caching system that provides significant performance improvements for tokenization operations. The implementation features an L0 exact-match cache and an innovative L1 special-token boundary prefix cache
Motivation
Tokenization is a frequent operation in the router, and many requests share common patterns:
The two-level cache design balances fast exact-match lookups (L0) with intelligent prefix reuse (L1).
Architecture
L0 Cache (Exact Match)
l0_max_entries
(default: 10,000 entries, ~22MB memory)L1 Cache (Special Token Boundary Prefix Cache)
l1_max_memory
(default: 50MB)How L1 Cache Works
L1 cache is a special-token boundary prefix cache that caches tokenization results at every special token boundary. Special tokens (like
<|im_start|>
,<|im_end|>
,<|eot_id|>
) are atomic in BPE tokenizers (special: true
,normalized: false
), making them the ONLY safe split points that guarantee correctness.Why Special Tokens
BPE tokenizers make context-dependent merge decisions, so
tokenize(prefix + suffix) != tokenize(prefix) + tokenize(suffix)
for arbitrary boundaries. However, special tokens are atomic and protected from normalization/merging, guaranteeing:when splitting at special token boundaries.
Cache Strategy - Re-Tokenization with All-Boundaries Approach
The L1 cache uses a re-tokenization approach to guarantee correctness. When inserting into the cache after a full tokenization:
This approach ensures 100% correctness because we never assume
tokenize(prefix) + tokenize(suffix) == tokenize(prefix + suffix)
. Instead, we always re-tokenize the prefix to get the actual token sequence that would result from tokenizingprefix + suffix
up to the boundary.Cache Lookup Example:
Key Performance Insight: On cache hit, we avoid re-tokenizing the prefix. On cache miss, we pay the cost of re-tokenizing prefixes once during insertion, then all future requests with that prefix benefit from the cached tokens.
Opt-In
Both caches are disabled by default to maintain backward compatibility. The
CachedTokenizer
wrapper is only created when at least one cache is enabled; otherwise, the base tokenizer is used directly.Changes Made
Tokenizer Factory (
src/tokenizer/factory.rs
)added_tokens
withspecial: true
propertyCLI & Configuration
CLI Arguments (
src/main.rs
)Bug Fixes
Special Token Detection (
src/tokenizer/factory.rs
)added_tokens
forspecial: true
propertyPerformance Results
benchmarks using Qwen3-4B-Instruct-2507 tokenizer on different workloads:
Benchmark Scenarios
Future Work
Potential enhancements:
Checklist