Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
57 commits
Select commit Hold shift + click to select a range
b2eb2b5
[Kernel] Apply torch.Tag.needs_fixed_stride_order only for torch==2.6…
zou3519 Jul 18, 2025
0f199f1
[Core] Avoid KVCacheBlock.__eq__ invocations in FreeKVCacheBlockQueue…
JialinOuyang-Meta Jul 18, 2025
5782581
[Bugfix] Voxtral on Blackwell GPUs (RTX 50 series) (#21077)
hax0r31337 Jul 18, 2025
2179372
Elastic Expert Parallel Initial Support (#20775)
ruisearch42 Jul 19, 2025
466e878
[Quantization] Enable BNB support for more MoE models (#21100)
jeejeelee Jul 19, 2025
9a9fda1
[Core] Support Local Chunked Attention for Hybrid KV Cache (#19351)
luccafong Jul 19, 2025
9ffe905
[Bugfix][Model] Fix LoRA for Mistral-Small-3.1-24B-Instruct-2503 (#21…
varun-sundar-rabindranath Jul 19, 2025
dd572c0
[V0 Deprecation] Remove V0 Spec Decode workers (#21152)
WoosukKwon Jul 19, 2025
dcc6cfb
[Kernel][Performance] Tweak MoE Batched silu_mul_fp8_quant_deep_gemm …
varun-sundar-rabindranath Jul 19, 2025
468e240
[BugFix][CPU] Fix `TorchSDPABackendImpl` doesn't have `use_irope` (#…
LucasWilkinson Jul 19, 2025
37bd8d6
[Bug] DeepGemm: Fix TypeError: per_block_cast_to_fp8() missing 1 requ…
yewentao256 Jul 19, 2025
3e04107
[Model] EXAONE 4.0 model support (#21060)
Deepfocused Jul 19, 2025
3a2cb26
[Misc][Tools][Benchmark] Add readme file for auto_tune script (#20779)
Chenyaaang Jul 19, 2025
cf8cc32
Fix a couple of Voxtral tests (#21218)
huydhn Jul 19, 2025
1eaff27
[V0 deprecation] Remove long context LoRA (#21169)
jeejeelee Jul 19, 2025
18e519e
[Bugfix] Fix ndarray video color from VideoAsset (#21064)
Isotr0py Jul 19, 2025
59f9353
[BugFix] Fix potential cuda-graph IMA (#21196)
LucasWilkinson Jul 19, 2025
7d94577
Add torch golden impl for moe_align_block_size kernel test (#20653)
shixianc Jul 19, 2025
6d0734c
[NVIDIA] Add SM100 Flashinfer MoE blockscale fp8 backend for low late…
kaixih Jul 19, 2025
b3d8210
[Bugfix][Frontend] Fix openai CLI arg `middleware` (#21220)
22quinn Jul 19, 2025
e3a0e43
[bugfix] Fix auto thread-binding when world_size > 1 in CPU backend a…
bigPYJ1151 Jul 19, 2025
c81259d
Fix/remove some broken model executor tests (#21224)
rabi Jul 19, 2025
da6579b
[CI/CD][bugfix]fix: error argument to loads has incompatible type (#2…
llsj14 Jul 19, 2025
6a971ed
[Docs] Update the link to the 'Prometheus/Grafana' example (#21225)
1195343015 Jul 19, 2025
9f414a1
[BugFix] Make PD work with Ray (#21072)
kouroshHakha Jul 19, 2025
881e3cb
[V1] [Hybrid] Enable piecewise CUDA Graph for mamba layers (#21194)
tdoublep Jul 19, 2025
752c6ad
[V0 Deprecation] Deprecate BlockSparse Attention & Phi3-Small (#21217)
WoosukKwon Jul 19, 2025
2e8cbb5
[BugFix] Fix full cuda graph slot_mapping (#21228)
fhl2000 Jul 19, 2025
10eb24c
GLM-4 Update (#20736)
zRzRzRzRzRzRzR Jul 19, 2025
2b504eb
[Docs] [V1] Update docs to remove enforce_eager limitation for hybrid…
tdoublep Jul 19, 2025
3a1d894
[TPU] support fp8 kv cache quantization (#19292)
yaochengji Jul 20, 2025
d1fb65b
Enable v1 metrics tests (#20953)
eicherseiji Jul 20, 2025
51ba839
[Model] use AutoWeightsLoader for bart (#18299)
calvin0327 Jul 20, 2025
9499e26
[Model] Support VLMs with transformers backend (#20543)
zucchini-nlp Jul 20, 2025
7ba34b1
[bugfix] fix syntax warning caused by backslash (#21251)
1195343015 Jul 20, 2025
8188196
[CI] Cleanup modelscope version constraint in Dockerfile (#21243)
yankay Jul 21, 2025
92615d7
[Docs] Add RFC Meeting to Issue Template (#21279)
simon-mo Jul 21, 2025
940af1f
Add the instruction to run e2e validation manually before release (#2…
huydhn Jul 21, 2025
378d33c
[Bugfix] Fix missing placeholder in logger debug (#21280)
DarkLight1337 Jul 21, 2025
042af0c
[Model][1/N] Support multiple poolers at model level (#21227)
DarkLight1337 Jul 21, 2025
be54a95
[Docs] Fix hardcoded links in docs (#21287)
hmellor Jul 21, 2025
e6b90a2
[Docs] Make tables more space efficient in `supported_models.md` (#21…
hmellor Jul 21, 2025
d978410
[Misc] unify variable for LLM instance (#20996)
andyxning Jul 21, 2025
6b46c4b
Add Nvidia ModelOpt config adaptation (#19815)
Edwardf0t1 Jul 21, 2025
6dda13c
[Misc] Add sliding window to flashinfer test (#21282)
WoosukKwon Jul 21, 2025
a15a50f
[CPU] Enable shared-memory based pipeline parallel for CPU backend (#…
bigPYJ1151 Jul 21, 2025
a0e827e
[BugFix] make utils.current_stream thread-safety (#21252) (#21253)
simpx Jul 21, 2025
6ece16c
[Misc] Add dummy maverick test (#21199)
minosfuture Jul 21, 2025
304dce7
[Attention] Clean up iRoPE in V1 (#21188)
LucasWilkinson Jul 21, 2025
29d1ffc
[DP] Fix Prometheus Logging (#21257)
robertgshaw2-redhat Jul 21, 2025
8b296c3
docs: Update docs article with usage patterns
sangstar Jul 10, 2025
110a6fd
fix: Rename headings and move content around
willgoldby Jul 11, 2025
480e84e
fix: Add title for Tensorizer configuration
willgoldby Jul 15, 2025
b2efb9f
docs: Update example file docstring
sangstar Jul 18, 2025
c1ba86c
docs: Revert original markdown title
sangstar Jul 18, 2025
dfe2850
style: Run linter
sangstar Jul 21, 2025
c67e148
docs: Resolve suggested changes from review
sangstar Jul 23, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 0 additions & 1 deletion .buildkite/scripts/hardware_ci/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -108,7 +108,6 @@ fi
if [[ $commands == *" kernels/attention"* ]]; then
commands="${commands} \
--ignore=kernels/attention/test_attention_selector.py \
--ignore=kernels/attention/test_blocksparse_attention.py \
--ignore=kernels/attention/test_encoder_decoder_attn.py \
--ignore=kernels/attention/test_flash_attn.py \
--ignore=kernels/attention/test_flashinfer.py \
Expand Down
18 changes: 9 additions & 9 deletions .buildkite/scripts/hardware_ci/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@ set -ex

# allow to bind to different cores
CORE_RANGE=${CORE_RANGE:-48-95}
# used for TP/PP E2E test
OMP_CORE_RANGE=${OMP_CORE_RANGE:-48-95}
NUMA_NODE=${NUMA_NODE:-1}

Expand All @@ -24,8 +25,8 @@ numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --tag cpu-test-"$NUMA_NODE
numactl -C "$CORE_RANGE" -N "$NUMA_NODE" docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" --tag cpu-test-"$NUMA_NODE"-avx2 --target vllm-test -f docker/Dockerfile.cpu .

# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_OMP_THREADS_BIND="$OMP_CORE_RANGE" --env VLLM_CPU_CI_ENV=1 --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE" cpu-test-"$NUMA_NODE"
docker run -itd --cpuset-cpus="$CORE_RANGE" --cpuset-mems="$NUMA_NODE" --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --privileged=true -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --env VLLM_CPU_CI_ENV=1 -e E2E_OMP_THREADS="$OMP_CORE_RANGE" --shm-size=4g --name cpu-test-"$NUMA_NODE"-avx2 cpu-test-"$NUMA_NODE"-avx2

function cpu_tests() {
set -e
Expand Down Expand Up @@ -78,17 +79,16 @@ function cpu_tests() {
# tests/quantization/test_ipex_quant.py"

# online serving
docker exec cpu-test-"$NUMA_NODE" bash -c "
docker exec cpu-test-"$NUMA_NODE" bash -c '
set -e
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m --dtype half &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
VLLM_CPU_CI_ENV=0 python3 benchmarks/benchmark_serving.py \
VLLM_CPU_OMP_THREADS_BIND=$E2E_OMP_THREADS VLLM_CPU_SGL_KERNEL=1 vllm serve meta-llama/Llama-3.2-3B-Instruct -tp=2 -pp=2 &
timeout 600 bash -c "until curl localhost:8000/v1/models; do sleep 1; done" || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--model meta-llama/Llama-3.2-3B-Instruct \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
--endpoint /v1/completions'

# Run multi-lora tests
docker exec cpu-test-"$NUMA_NODE" bash -c "
Expand Down
15 changes: 1 addition & 14 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,6 @@ steps:
- tests/distributed/test_utils
- tests/distributed/test_pynccl
- tests/distributed/test_events
- tests/spec_decode/e2e/test_integration_dist_tp4
- tests/compile/test_basic_correctness
- examples/offline_inference/rlhf.py
- examples/offline_inference/rlhf_colocate.py
Expand All @@ -182,7 +181,6 @@ steps:
- pytest -v -s compile/test_basic_correctness.py
- pytest -v -s distributed/test_pynccl.py
- pytest -v -s distributed/test_events.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py
# TODO: create a dedicated test section for multi-GPU example tests
# when we have multiple distributed example tests
- pushd ../examples/offline_inference
Expand Down Expand Up @@ -266,6 +264,7 @@ steps:
- pytest -v -s v1/structured_output
- pytest -v -s v1/spec_decode
- pytest -v -s v1/kv_connector/unit
- pytest -v -s v1/metrics
- pytest -v -s v1/test_serial_utils.py
- pytest -v -s v1/test_utils.py
- pytest -v -s v1/test_oracle.py
Expand Down Expand Up @@ -330,17 +329,6 @@ steps:
- pytest -v -s samplers
- VLLM_USE_FLASHINFER_SAMPLER=1 pytest -v -s samplers

- label: Speculative decoding tests # 40min
mirror_hardwares: [amdexperimental]
source_file_dependencies:
- vllm/spec_decode
- tests/spec_decode
- vllm/model_executor/models/eagle.py
commands:
- pytest -v -s spec_decode/e2e/test_multistep_correctness.py
- VLLM_ATTENTION_BACKEND=FLASH_ATTN pytest -v -s spec_decode --ignore=spec_decode/e2e/test_multistep_correctness.py --ignore=spec_decode/e2e/test_mtp_correctness.py
- pytest -v -s spec_decode/e2e/test_eagle_correctness.py

- label: LoRA Test %N # 15min each
mirror_hardwares: [amdexperimental, amdproduction]
source_file_dependencies:
Expand Down Expand Up @@ -726,7 +714,6 @@ steps:
- pytest -v -s distributed/test_sequence_parallel.py
# this test fails consistently.
# TODO: investigate and fix
# - pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- VLLM_USE_V1=0 CUDA_VISIBLE_DEVICES=0,1 pytest -v -s kv_transfer/test_disagg.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s v1/shutdown
Expand Down
1 change: 0 additions & 1 deletion .github/CODEOWNERS
Original file line number Diff line number Diff line change
Expand Up @@ -43,7 +43,6 @@ CMakeLists.txt @tlrmchlsmth @LucasWilkinson
/tests/multimodal @DarkLight1337 @ywang96
/tests/prefix_caching @comaniac @KuntaiDu
/tests/quantization @mgoin @robertgshaw2-redhat
/tests/spec_decode @njhill @LiuXiaoxuanPKU
/tests/test_inputs.py @DarkLight1337 @ywang96
/tests/v1/entrypoints/llm/test_struct_output_generate.py @mgoin @russellb @aarnphm
/tests/v1/structured_output @mgoin @russellb @aarnphm
Expand Down
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/750-RFC.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ body:
- type: markdown
attributes:
value: >
Thanks for contributing 🎉!
Thanks for contributing 🎉! The vLLM core team hosts a biweekly RFC review session at 9:30AM Pacific Time, while most RFCs can be discussed online, you can optionally sign up for a slot to discuss your RFC online [here](https://docs.google.com/document/d/1CiLVBZeIVfR7_PNAKVSusxpceywkoOOB78qoWqHvSZc/edit).
- type: checkboxes
id: askllm
attributes:
Expand Down
3 changes: 0 additions & 3 deletions .github/mergify.yml
Original file line number Diff line number Diff line change
Expand Up @@ -164,10 +164,7 @@ pull_request_rules:
description: Automatically apply speculative-decoding label
conditions:
- or:
- files~=^vllm/spec_decode/
- files~=^vllm/v1/spec_decode/
- files=vllm/model_executor/layers/spec_decode_base_sampler.py
- files~=^tests/spec_decode/
- files~=^tests/v1/spec_decode/
- files~=^examples/.*(spec_decode|mlpspeculator|eagle|speculation).*\.py
- files~=^vllm/model_executor/models/.*eagle.*\.py
Expand Down
33 changes: 33 additions & 0 deletions RELEASE.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,3 +52,36 @@ After branch cut, we approach finalizing the release branch with clear criteria
* Release branch specific changes (e.g. change version identifiers or CI fixes)

Please note: **No feature work allowed for cherry picks**. All PRs that are considered for cherry-picks need to be merged on trunk, the only exception are Release branch specific changes.

## Manual validations

### E2E Performance Validation

Before each release, we perform end-to-end performance validation to ensure no regressions are introduced. This validation uses the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) on PyTorch CI.

**Current Coverage:**
* Models: Llama3, Llama4, and Mixtral
* Hardware: NVIDIA H100 and AMD MI300x
* *Note: Coverage may change based on new model releases and hardware availability*

**Performance Validation Process:**

**Step 1: Get Access**
Request write access to the [pytorch/pytorch-integration-testing](https://github.com/pytorch/pytorch-integration-testing) repository to run the benchmark workflow.

**Step 2: Review Benchmark Setup**
Familiarize yourself with the benchmark configurations:
* [CUDA setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/cuda)
* [ROCm setup](https://github.com/pytorch/pytorch-integration-testing/tree/main/vllm-benchmarks/benchmarks/rocm)

**Step 3: Run the Benchmark**
Navigate to the [vllm-benchmark workflow](https://github.com/pytorch/pytorch-integration-testing/actions/workflows/vllm-benchmark.yml) and configure:
* **vLLM branch**: Set to the release branch (e.g., `releases/v0.9.2`)
* **vLLM commit**: Set to the RC commit hash

**Step 4: Review Results**
Once the workflow completes, benchmark results will be available on the [vLLM benchmark dashboard](https://hud.pytorch.org/benchmark/llms?repoName=vllm-project%2Fvllm) under the corresponding branch and commit.

**Step 5: Performance Comparison**
Compare the current results against the previous release to verify no performance regressions have occurred. Here is an
example of [v0.9.1 vs v0.9.2](https://hud.pytorch.org/benchmark/llms?startTime=Thu%2C%2017%20Apr%202025%2021%3A43%3A50%20GMT&stopTime=Wed%2C%2016%20Jul%202025%2021%3A43%3A50%20GMT&granularity=week&lBranch=releases/v0.9.1&lCommit=b6553be1bc75f046b00046a4ad7576364d03c835&rBranch=releases/v0.9.2&rCommit=a5dd03c1ebc5e4f56f3c9d3dc0436e9c582c978f&repoName=vllm-project%2Fvllm&benchmarkName=&modelName=All%20Models&backendName=All%20Backends&modeName=All%20Modes&dtypeName=All%20DType&deviceName=All%20Devices&archName=All%20Platforms).
137 changes: 137 additions & 0 deletions benchmarks/auto_tune/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,137 @@
# Automated vLLM Server Parameter Tuning

This script automates the process of finding the optimal server parameter combination (`max-num-seqs` and `max-num-batched-tokens`) to maximize throughput for a vLLM server. It also supports additional constraints such as E2E latency and prefix cache hit rate.

## Table of Contents
- [Prerequisites](#prerequisites)
- [Configuration](#configuration)
- [How to Run](#how-to-run)
- [Example Use Cases](#example-use-cases)
- [Output](#output)
- [How It Works](#how-it-works)

## Prerequisites

Before running the script, please ensure the following steps are completed:

1. **Clone vLLM & Set Up Branch**: Clone the vLLM repository and check out to your desired branch.

```bash
git clone https://github.com/vllm-project/vllm.git
cd vllm
# git checkout <your-branch>
```

1. **Install Environment**: Install or update the correct running environment. For TPU usage, activate your `conda` environment and install the corresponding `torch` and `torch_xla` versions.

2. **Model Configuration**: If you are using a customized model, ensure its configuration files are correctly placed and accessible.

## Configuration

You must set the following variables at the top of the script before execution.

| Variable | Description | Example Value |
| --- | --- | --- |
| `BASE` | **Required.** The absolute path to the parent directory of your vLLM repository directory. | `"$HOME"` |
| `MODEL` | **Required.** The Hugging Face model identifier to be served by vllm. | `"meta-llama/Llama-3.1-8B-Instruct"` |
| `SYSTEM`| **Required.** The hardware you are running on. Choices: `TPU` or `GPU`. (For other systems, it might not support saving profiles) | `"TPU"` |
| `TP` | **Required.** The tensor-parallelism size. | `1` |
| `DOWNLOAD_DIR` | **Required.** Directory to download and load model weights from. | `""` (default download path) |
| `INPUT_LEN` | **Required.** Request input length. | `4000` |
| `OUTPUT_LEN` | **Required.** Request output length. | `16` |
| `MIN_CACHE_HIT_PCT` | Prefix cache hit rate in percentage (0-100). Set to `0` to disable. | `60` |
| `MAX_LATENCY_ALLOWED_MS` | The maximum allowed P99 end-to-end latency in milliseconds. Set to a very large number (e.g., `100000000000`) to effectively ignore the latency constraint. | `500` |
| `NUM_SEQS_LIST` | A space-separated string of `max-num-seqs` values to test. | `"128 256"` |
| `NUM_BATCHED_TOKENS_LIST` | A space-separated string of `max-num-batched-tokens` values to test. | `"1024 2048 4096"` |

**Note**: The default `NUM_SEQS_LIST` and `NUM_BATCHED_TOKENS_LIST` are set for medium-sized inputs/outputs. For very short contexts (e.g., 20 input, 20 output tokens), you may need to test larger values for `max-num-seqs`.

## How to Run

1. **Configure**: Edit the script and set the variables in the [Configuration](#configuration) section.
2. **Execute**: Run the script. Since the process can take a long time, it is highly recommended to use a terminal multiplexer like `tmux` or `screen` to prevent the script from stopping if your connection is lost.

```
cd <FOLDER_OF_THIS_SCRIPT>
bash auto_tune.sh
```

Please note that the `bash auto_tune.sh` command cannot contain full or partial path with keyword `vllm`, otherwise `pkill -f vllm` command will also kill this script itself.

## Example Use Cases

Here are a few examples of how to configure the script for different goals:

### 1. Maximize Throughput (No Latency Constraint)
- **Goal**: Find the best `max-num-seqs` and `max-num-batched-tokens` to get the highest possible throughput for 1800 input tokens and 20 output tokens.
- **Configuration**:

```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=100000000000 # A very large number
```

#### 2. Maximize Throughput with a Latency Requirement
- **Goal**: Find the best server parameters when P99 end-to-end latency must be below 500ms.
- **Configuration**:

```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MIN_CACHE_HIT_PCT=0
MAX_LATENCY_ALLOWED_MS=500
```

#### 3. Maximize Throughput with Prefix Caching and Latency Requirements
- **Goal**: Find the best server parameters assuming a 60% prefix cache hit rate and a latency requirement of 500ms.
- **Configuration**:

```bash
INPUT_LEN=1800
OUTPUT_LEN=20
MIN_CACHE_HIT_PCT=60
MAX_LATENCY_ALLOWED_MS=500
```

## Output

After the script finishes, you will find the results in a new, timestamped directory created inside `$BASE/auto-benchmark/`.

- **Log Files**: The directory (`$BASE/auto-benchmark/YYYY_MM_DD_HH_MM/`) contains detailed logs for each run:
- `vllm_log_...txt`: The log output from the vLLM server for each parameter combination.
- `bm_log_...txt`: The log output from the `benchmark_serving.py` script for each benchmark run.

- **Final Result Summary**: A file named `result.txt` is created in the log directory. It contains a summary of each tested combination and concludes with the overall best parameters found.

```
# Example result.txt content
hash:a1b2c3d4...
max_num_seqs: 128, max_num_batched_tokens: 2048, request_rate: 10.0, e2el: 450.5, throughput: 9.8, goodput: 9.8
max_num_seqs: 128, max_num_batched_tokens: 4096 does not meet latency requirement 500
...
best_max_num_seqs: 256, best_num_batched_tokens: 2048, best_throughput: 12.5, profile saved in: /home/user/vllm/auto-benchmark/2024_08_01_10_30/profile
```

If it cannot find the best parameters, the final row will be `best_max_num_seqs: 0, best_num_batched_tokens: 0, best_throughput: 0`. This can be due to either the server not starting properly, or the latency requirement being too strict.

- **Profiler Trace**: A directory named `profile` is created inside the log directory. It contains the profiler trace file (e.g., `.xplane.pb` for TPU or a `.json` trace for GPU) from the single best-performing run.

## How It Works

The script follows a systematic process to find the optimal parameters:

1. **Find Max GPU Memory Utilization**: The script first determines the highest safe `gpu-memory-utilization` (starting from 0.98 and decreasing) that does not cause an Out-Of-Memory (OOM) error when launching the server. This ensures the benchmark runs use the maximum available memory without crashing.

2. **Iterate and Benchmark**: It then enters a nested loop, iterating through every combination of `max-num-seqs` and `max-num-batched-tokens` provided in the configuration lists.

3. **Latency-Aware Throughput Search**: For each parameter combination:
- The vLLM server is started.
- A benchmark is first run with an infinite request rate (`--request-rate inf`).
- If the resulting P99 E2E latency is within the `MAX_LATENCY_ALLOWED_MS` limit, this throughput is considered the maximum for this configuration.
- If the latency is too high, the script performs a search by iteratively decreasing the request rate until the latency constraint is met. This finds the highest sustainable throughput for the given parameters and latency requirement.

4. **Track Best Result**: Throughout the process, the script tracks the parameter combination that has yielded the highest valid throughput so far.

5. **Profile Collection**: For the best-performing run, the script saves the vLLM profiler output, which can be used for deep-dive performance analysis with tools like TensorBoard.
31 changes: 1 addition & 30 deletions benchmarks/auto_tune.sh → benchmarks/auto_tune/auto_tune.sh
Original file line number Diff line number Diff line change
@@ -1,36 +1,7 @@
#!/bin/bash

# This script aims to tune the best server parameter combinations to maximize throughput for given requirement.
# The current server parameter combination is max_num_seqs and max_num_batched_tokens
# It also supports additional requirement: e2e latency and prefix cache.

# Pre-requisite:
# 1. Checkout to your branch, install/ update the correct running env. For TPU, activate conda env and install the corresponding torch, xla version.
# 2. If the model is customized, replace the MODEL's config with the customized config.
# 3. Set variables (ALL REQUIRED)
# BASE: your directory for vllm repo
# MODEL: the model served by vllm
# SYSTEM: the hardware, choice TPU or GPU, for other systems, "get best profile" might not support.
# TP: ways of tensor parallelism
# DOWNLOAD_DIR: directory to download and load model weights.
# INPUT_LEN: request input len
# OUTPUT_LEN: request output len
# MIN_CACHE_HIT_PCT: prefix cache rate
# MAX_LATENCY_ALLOWED_MS: (e2e) latency requirement. If there's no latency requirement, set it to a large number like 1000000000
# NUM_SEQS_LIST: a list of `max-num-seqs` you want to loop with.
# NUM_BATCHED_TOKENS_LIST: a list of `max-num-batched-tokens` you want to loop with.
# Note that the default NUM_SEQS_LIST and NUM_BATCHED_TOKENS_LIST are set for medium size input/output len, for extra short context (such as 20:20), you might need to include larger numbers in NUM_SEQS_LIST.
# 4. Run the script, it might take a long time, you can use tmux to avoid the script stop if disconnection happens.
# 5. The final result will be saved in RESULT file.


# Example use cases
# 1. Given input_len=1800, output_len=20, what's the best max_num_seqs and max_num_batched_tokens to get highest throughput?
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=100000000000
# 2. If we have latency requirement to be lower than 500ms, what's the best server parameter?
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=0, MAX_LATENCY_ALLOWED_MS=500
# 3. If we want to reach 60% prefix cache, what's the best server parameter?
# Use INPUT_LEN=1800, OUTPUT_LEN=20, MIN_CACHE_HIT_PCT=60, MAX_LATENCY_ALLOWED_MS=500
# See details in README (benchmarks/auto_tune/README.md).

TAG=$(date +"%Y_%m_%d_%H_%M")
BASE=""
Expand Down
6 changes: 5 additions & 1 deletion benchmarks/kernels/benchmark_moe.py
Original file line number Diff line number Diff line change
Expand Up @@ -576,7 +576,11 @@ def main(args: argparse.Namespace):
topk = config.num_experts_per_tok
intermediate_size = config.intermediate_size
shard_intermediate_size = 2 * intermediate_size // args.tp_size
elif config.architectures[0] in ("DeepseekV3ForCausalLM", "DeepseekV2ForCausalLM"):
elif config.architectures[0] in (
"DeepseekV3ForCausalLM",
"DeepseekV2ForCausalLM",
"Glm4MoeForCausalLM",
):
E = config.n_routed_experts
topk = config.num_experts_per_tok
intermediate_size = config.moe_intermediate_size
Expand Down
1 change: 1 addition & 0 deletions benchmarks/kernels/benchmark_moe_permute_unpermute.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,6 +318,7 @@ def main(args: argparse.Namespace):
elif (
config.architectures[0] == "DeepseekV3ForCausalLM"
or config.architectures[0] == "DeepseekV2ForCausalLM"
or config.architectures[0] == "Glm4MoeForCausalLM"
):
E = config.n_routed_experts
topk = config.num_experts_per_tok
Expand Down
Loading