Skip to content

Conversation

hjh0119
Copy link
Collaborator

@hjh0119 hjh0119 commented Sep 11, 2025

Optimize weight synchronization between the training model and the inference engine (vLLM):

LoRA

  1. Synchronize/load only the trained adapter weights.(both colocate / server mode)
  2. In server mode, transmit flattened adapter weights to reduce communication overhead of model parameters.

FULL

  1. Remove the original per-tensor synchronization logic and adopt a bucketing strategy to reduce redundant communication requests and overhead, especially for MoE models (which have more tensors than dense models).

@hjh0119 hjh0119 changed the title [grpo] Optimize LoRA training vLLM weight synchronization [WIP] Optimize LoRA training vLLM weight synchronization Sep 11, 2025
@hjh0119 hjh0119 marked this pull request as ready for review September 11, 2025 09:27
@hjh0119
Copy link
Collaborator Author

hjh0119 commented Sep 11, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an optimization for LoRA training with vLLM by enabling in-memory weight synchronization using flattened tensors. This avoids disk I/O and should improve training speed. The changes involve adding new arguments, new protocol definitions, and new methods in the rollout engine and GRPO trainer. A key part of the implementation is monkey-patching vLLM to support loading LoRA adapters from tensors. The overall approach is sound, but there are a few areas that need attention, such as ensuring deterministic adapter selection, cleaning up commented-out code, and addressing TODO comments.

@hjh0119 hjh0119 changed the title [WIP] Optimize LoRA training vLLM weight synchronization Optimize LoRA training vLLM weight synchronization Sep 12, 2025
@hjh0119
Copy link
Collaborator Author

hjh0119 commented Sep 12, 2025

Qwen2.5-VL-7B-Instruct, server mode, tp=2, dp=2 → 10× speed-up

image

@hjh0119
Copy link
Collaborator Author

hjh0119 commented Sep 12, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant optimization for LoRA training with vLLM by synchronizing only the adapter weights instead of the full model. This is achieved by patching vLLM to load adapters from in-memory tensors and flattening the weights for more efficient communication. The changes are well-implemented across the trainer, rollout logic, and communication protocols. The documentation has also been updated to reflect these new features. My review includes a few suggestions to improve code quality and fix a minor typo in the documentation.

@hjh0119 hjh0119 changed the title Optimize LoRA training vLLM weight synchronization [grpo] Optimize LoRA training vLLM weight synchronization Sep 12, 2025
@hjh0119 hjh0119 changed the title [grpo] Optimize LoRA training vLLM weight synchronization [grpo] Optimize vLLM weight synchronization Oct 10, 2025
@hjh0119 hjh0119 changed the title [grpo] Optimize vLLM weight synchronization [grpo] Optimize vLLM weight synchronization for server mode Oct 10, 2025
@hjh0119 hjh0119 changed the title [grpo] Optimize vLLM weight synchronization for server mode [grpo] Optimize vLLM weight synchronization Oct 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant