[NPUW]Implement prefix caching. #31669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

intelgaoxiong wants to merge 35 commits into openvinotoolkit:master from intelgaoxiong:xiong/prefix_caching

+1,328 −155

Contributor

intelgaoxiong commented Aug 11, 2025 •

edited

Loading

Details:

This PR implements prefix caching in NPUW.
In existing inference engines, the KV cache of a request is discarded after processing is completed, preventing the KV cache from being reused across multiple calls and significantly slowing down the execution.
https://arxiv.org/pdf/ 2312.07104 proposed a new technique to reuse KV cache automatically across multiple generation calls.

Tickets:

EISW-177126

github-actions bot added category: NPU category: NPUW labels

intelgaoxiong force-pushed the xiong/prefix_caching branch 4 times, most recently from a7a59db to 48f060e Compare

August 13, 2025 07:03

Contributor Author

intelgaoxiong commented Aug 13, 2025

build_jenkins

intelgaoxiong marked this pull request as ready for review

August 13, 2025 09:11

intelgaoxiong requested review from a team as code owners

August 13, 2025 09:11

intelgaoxiong requested a review from dmatveev

August 13, 2025 09:12

intelgaoxiong force-pushed the xiong/prefix_caching branch 5 times, most recently from 719eac4 to c9ee5f6 Compare

August 22, 2025 02:08

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/al/include/intel_npu/config/npuw.hpp Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_compiled_model.cpp Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/serialization.hpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.hpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_infer_request.cpp Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_prefix_caching.hpp Outdated Show resolved Hide resolved

smirnov-alexey reviewed

View reviewed changes

src/plugins/intel_npu/src/plugin/npuw/llm_prefix_caching.hpp Outdated Show resolved Hide resolved

intelgaoxiong added 30 commits

September 5, 2025 04:47


          Clang format.

24053c6

Signed-off-by: intelgaoxiong <[email protected]>


          Optimize algo: decrease hash calculation.

e0fc886

Signed-off-by: intelgaoxiong <[email protected]>


          Optimize algo: Create output to input name map.

af25f0c

Signed-off-by: intelgaoxiong <[email protected]>


          Add functions to restore and store cache.

ffa5b11

Signed-off-by: intelgaoxiong <[email protected]>


          Refine classes and functions.

20e49fd

Signed-off-by: intelgaoxiong <[email protected]>


          Refine cache print.

Signed-off-by: intelgaoxiong <[email protected]>


          Add option to toggle prefix caching.

32e202b

Signed-off-by: intelgaoxiong <[email protected]>


          Calculate and print block memory size.

a2b8a63

Signed-off-by: intelgaoxiong <[email protected]>


          Modify KVBlock to a class type.

Signed-off-by: intelgaoxiong <[email protected]>


          Fixed cache eviction.

64302d1

Signed-off-by: intelgaoxiong <[email protected]>


          Add get_block_unsafe.

0dcf30e

Signed-off-by: intelgaoxiong <[email protected]>


          Use uint64_t for hash.

36748c2

Signed-off-by: intelgaoxiong <[email protected]>


          Add options for block size and max block number.

a470090

Signed-off-by: intelgaoxiong <[email protected]>


          Variable rename in class.

d8971af

Signed-off-by: intelgaoxiong <[email protected]>


          Bug fix.

9ae0852

Signed-off-by: intelgaoxiong <[email protected]>


          Fixed cache eviction.

71e5f5a

Signed-off-by: intelgaoxiong <[email protected]>


          Efficient cache eviction.

d12c8e1

Signed-off-by: intelgaoxiong <[email protected]>


          More fix for cache eviction.

6081bc2

Signed-off-by: intelgaoxiong <[email protected]>


          Add unit test and bug fix.

319885b

Signed-off-by: intelgaoxiong <[email protected]>


          Make private variable.

be09f72

Signed-off-by: intelgaoxiong <[email protected]>


          Disable cache print.

3b6b2fc

Signed-off-by: intelgaoxiong <[email protected]>


          Fixed for CI check.

e7c1903

Signed-off-by: intelgaoxiong <[email protected]>


          Fixed some accuracy issues.

4facf02

Signed-off-by: intelgaoxiong <[email protected]>


          Solved review comments.

02080d6

Signed-off-by: intelgaoxiong <[email protected]>


          Register properties for prefix caching.

5cc8a13

Signed-off-by: intelgaoxiong <[email protected]>


          Refine log.

c58281d

Signed-off-by: intelgaoxiong <[email protected]>


          Fixed typo.

4bfd0e2

Signed-off-by: intelgaoxiong <[email protected]>


          Changed to LOG_VERB and const methods.

b4be394

Signed-off-by: intelgaoxiong <[email protected]>


          Decouple functions with LLMInferRequest and add function description.

18877de

Signed-off-by: intelgaoxiong <[email protected]>


          Remove default cache size.

3259df6

Signed-off-by: intelgaoxiong <[email protected]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: build category: NPU category: NPUW