Skip to content

Conversation

slin1237
Copy link
Collaborator

@slin1237 slin1237 commented Oct 16, 2025

Summary

This PR introduces a two-level tokenizer caching system that provides significant performance improvements for tokenization operations. The implementation features an L0 exact-match cache and an innovative L1 special-token boundary prefix cache

Motivation

Tokenization is a frequent operation in the router, and many requests share common patterns:

  • Repeated requests: Identical prompts or messages are tokenized multiple times
  • Common prefixes: System prompts, chat templates, and conversation history create shared prefix patterns
  • Performance opportunity: Caching tokenization results can significantly reduce redundant computation

The two-level cache design balances fast exact-match lookups (L0) with intelligent prefix reuse (L1).

Architecture

L0 Cache (Exact Match)

  • Purpose: Fast lookup for identical input strings
  • Implementation: DashMap-based concurrent hash map with lock-free reads
  • Configuration: l0_max_entries (default: 10,000 entries, ~22MB memory)
  • Use case: Repeated identical prompts, common system messages
  • Eviction: Simple arbitrary eviction when capacity reached

L1 Cache (Special Token Boundary Prefix Cache)

  • Purpose: Reuse tokenization results for shared prefixes at special token boundaries
  • Implementation: Re-tokenization approach - caches prefix tokens by re-tokenizing at ALL special token boundaries
  • Configuration: l1_max_memory (default: 50MB)
  • Use case: Chat conversations with shared system prompts, multi-turn interactions
  • Eviction: LRU eviction with memory tracking
  • Correctness: Guarantees 100% correctness by re-tokenizing prefixes (not storing raw strings)

How L1 Cache Works

L1 cache is a special-token boundary prefix cache that caches tokenization results at every special token boundary. Special tokens (like <|im_start|>, <|im_end|>, <|eot_id|>) are atomic in BPE tokenizers (special: true, normalized: false), making them the ONLY safe split points that guarantee correctness.

Why Special Tokens

BPE tokenizers make context-dependent merge decisions, so tokenize(prefix + suffix) != tokenize(prefix) + tokenize(suffix) for arbitrary boundaries. However, special tokens are atomic and protected from normalization/merging, guaranteeing:

tokenize(prefix + suffix) == tokenize(prefix) + tokenize(suffix)

when splitting at special token boundaries.

Cache Strategy - Re-Tokenization with All-Boundaries Approach

The L1 cache uses a re-tokenization approach to guarantee correctness. When inserting into the cache after a full tokenization:

  1. Find all special token boundaries in the input string
  2. For each boundary: Extract the prefix substring up to that boundary
  3. Re-tokenize the prefix to get the exact token sequence (BPE-safe)
  4. Cache the prefix tokens with Blake3 hash of the prefix string as the key

This approach ensures 100% correctness because we never assume tokenize(prefix) + tokenize(suffix) == tokenize(prefix + suffix). Instead, we always re-tokenize the prefix to get the actual token sequence that would result from tokenizing prefix + suffix up to the boundary.

Input: "<|im_start|>system\nYou are helpful.<|im_end|><|im_start|>user\nHello!<|im_end|>"

Special token boundaries found:
1. After "<|im_start|>" at position 13
2. After "<|im_end|>" at position 45
3. After "<|im_start|>" at position 58
4. After "<|im_end|>" at position 72

For each boundary:
- Extract prefix string (e.g., "<|im_start|>system\nYou are helpful.<|im_end|>")
- Re-tokenize prefix → get exact token IDs
- Cache: hash(prefix_string) → prefix_tokens

Cache Lookup Example:

Input: "<|im_start|>system\nYou are helpful.<|im_end|><|im_start|>user\nWhat is 2+2?<|im_end|>"

1. Find all special token boundaries in input string
2. For each boundary (longest to shortest):
   - Extract prefix substring up to boundary
   - Hash prefix with Blake3
   - Look up hash in cache
   - If HIT: Use cached prefix tokens + tokenize remaining suffix → merge and return
   - If MISS: Try next shorter prefix
3. If no prefix match:
   - Tokenize full string
   - Re-tokenize and cache prefixes at ALL boundaries for future requests

Key Performance Insight: On cache hit, we avoid re-tokenizing the prefix. On cache miss, we pay the cost of re-tokenizing prefixes once during insertion, then all future requests with that prefix benefit from the cached tokens.

Opt-In

Both caches are disabled by default to maintain backward compatibility. The CachedTokenizer wrapper is only created when at least one cache is enabled; otherwise, the base tokenizer is used directly.

Changes Made

Tokenizer Factory (src/tokenizer/factory.rs)

  • Fixed special token detection in HuggingFace tokenizer
  • Properly extracts added_tokens with special: true property
  • Ensures special tokens are correctly identified for L1 cache

CLI & Configuration

CLI Arguments (src/main.rs)

--tokenizer-cache-enable-l0              # Enable L0 exact-match cache
--tokenizer-cache-l0-max-entries <N>     # L0 cache size (default: 10000)
--tokenizer-cache-enable-l1              # Enable L1 prefix cache
--tokenizer-cache-l1-max-memory <BYTES>  # L1 max memory (default: 52428800)

Bug Fixes

Special Token Detection (src/tokenizer/factory.rs)

  • Fixed HuggingFace tokenizer special token extraction
  • Properly filters added_tokens for special: true property
  • Ensures L1 cache has correct special tokens to work with

Performance Results

benchmarks using Qwen3-4B-Instruct-2507 tokenizer on different workloads:

Benchmark Scenarios

Scenario Configuration Throughput Latency L1 Hit Rate L0 Hit Rate Speedup vs Baseline
Realistic Chat (95%+ prefix reuse) L0+L1 18,751 ops/sec 53.3µs 99.2% 0.8% 18.2x
Customer Service (100% prefix reuse) L1-only 21,359 ops/sec 46.8µs 88.9% N/A 22.7x
Customer Service (100% prefix reuse) L0+L1 19,494 ops/sec 51.2µs 87.5% 11.1% 21.1x
Multi-turn Conversation L0+L1 4,469 ops/sec 223.8µs 66.7% 0.0% 4.3x
Code Review L0+L1 21,113 ops/sec 47.4µs 75.0% 0.0% 21.1x
Baseline (no cache) None 940 ops/sec 1,064µs N/A N/A 1.0x

Future Work

Potential enhancements:

  • Metrics & observability: Expose Prometheus metrics for cache hit/miss rates
  • Per-model configuration: Different cache settings for different tokenizers
  • Cache warming: Pre-populate cache with common patterns at startup
  • Dynamic tuning: Auto-adjust cache sizes based on workload
  • Cross-request sharing: Share cache across multiple router instances

Checklist

Copy link
Contributor

Summary of Changes

Hello @slin1237, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a robust and configurable caching layer for the tokenizer within the router, addressing the performance bottleneck of frequent tokenization operations. By implementing both an exact-match (L0) and a prefix-matching (L1) cache, the system can efficiently reuse tokenization results for identical prompts and common conversational prefixes. This enhancement aims to significantly reduce computational overhead and improve overall throughput, especially in workloads characterized by repetitive inputs or shared conversational contexts, while remaining opt-in for user control.

Highlights

  • Two-Level Tokenizer Caching: Introduced a configurable two-level tokenizer caching mechanism to significantly reduce redundant tokenization computation. This includes an L0 cache for exact string matches and an L1 cache for shared prefix matching.
  • L0 Exact Match Cache: Implemented an L0 cache that stores full tokenization results for identical input strings, providing fast lookups for repeated prompts and common system messages. It is configured by l0_max_entries (default: 10,000 entries).
  • L1 Fixed-Boundary Prefix Cache: Developed an L1 cache designed to reuse tokenization results for shared prefixes, particularly beneficial for chat conversations with common system prompts or history. It uses Blake3 hashing, fixed 128-byte boundaries, and LRU eviction, configurable by l1_max_memory (default: 50MB) and l1_granularity (default: 128 bytes).
  • Opt-In Design and Configuration: Both L0 and L1 caches are disabled by default to maintain backward compatibility. They can be enabled and configured via new CLI arguments, Python bindings, or configuration files, ensuring flexibility for users.
  • Performance Benchmarks: Included new benchmarks demonstrating substantial performance improvements, with L1 cache achieving 70-85% hit rates for chat templates and 2-6x speedup for requests with cached prefixes in production-scale scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a two-level tokenizer caching system (L0 for exact matches, L1 for prefixes) to enhance performance by reducing redundant tokenization. The implementation is well-structured, with new configurations, CLI arguments, and extensive benchmarks demonstrating significant speedups.

My review focuses on a critical correctness issue in the L1 cache logic that could lead to silent tokenization errors, along with suggestions for improving cache eviction performance and fixing minor issues in the benchmarks and Python argument help text. Overall, this is a valuable performance enhancement, but the L1 cache's correctness needs to be addressed before it's enabled in production.

@slin1237
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-designed two-level tokenizer caching system, which is a significant performance enhancement. The implementation is thorough, with new configurations, validation, and extensive benchmarks. The L0 exact-match cache and L1 prefix cache are both valuable additions. The code is generally high quality. My review focuses on a few areas for improvement regarding correctness, performance, and maintainability.

@slin1237 slin1237 force-pushed the tokenizer branch 2 times, most recently from 050fd78 to a526723 Compare October 17, 2025 02:07
@slin1237
Copy link
Collaborator Author

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-designed and configurable two-level tokenizer caching system (L0 for exact matches, L1 for prefix matches), which promises significant performance improvements for tokenization operations. The implementation is robust, using concurrent data structures and sound caching strategies. The addition of comprehensive benchmarks is excellent for validating the performance gains. The configuration options are cleanly integrated into both the CLI and Python arguments.

I've identified a couple of minor opportunities for refactoring to improve code maintainability and reduce redundancy, particularly in the new benchmark suite and the cache implementation. Overall, this is a high-quality contribution that will bring tangible performance benefits.

@CatherineSue
Copy link
Collaborator

Left: CachedTokenizer vs Right: Non-cached
Model: Llama-4-Maverick-FP8
Hardware: 8*H100
Benchmark: lm_eval mmlu_pro
Screenshot 2025-10-18 at 1 08 42 PM

@slin1237 slin1237 merged commit a7ae61e into main Oct 19, 2025
34 checks passed
@slin1237 slin1237 deleted the tokenizer branch October 19, 2025 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants