Skip to content

Conversation

bigximik
Copy link
Contributor

@bigximik bigximik commented Jul 30, 2025

✨ Description

Add tensor parallelism support for HF wrapper forward and lm_eval integration

Closes #334

🔍 Type of change

Select all that apply:

  • 🐛 Bug fix (non-breaking change that addresses a specific issue)
  • 🚀 New feature (non-breaking change that adds functionality)
  • ⚠️ Breaking change (a change that could affect existing functionality)
  • 📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
  • 🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
  • 📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
  • 📝 Documentation change (updates documentation, including new content or typo fixes)
  • 🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

Key updates introduced in this PR:

  1. Fixed a bug where the batch config was being read from the wrong place.
  2. Added additional broadcast primitives and optimized _object_to_tensor for faster performance (following PyTorch sources).
  3. Added tensor-parallel logits collection in the model head.
  4. Added tensor-parallel support to forward.
  5. Added coordinator-forward mode, which allows generate to run only on data-parallel leader ranks while tensor parallel workers participate through worker_forward.
  6. Added model and pipeline parallelism support to the lm_eval wrapper.
  7. Added wait barriers in critical places, as the standard 60s timeout on distributed primitives was insufficient in cases such as slow post-processing for some lm_eval tasks or when batches are incomplete and some data-parallel ranks have no data.

🗒️ Notes and Known Issues

  • Manually tested with DP and TP on 2 GPUs, DP+TP on 4 GPUs, and also on a single GPU.
  • There is a problem with CUDA memory fragmentation, potentially caused by scattering and broadcasting tensors of different size.
  • For some tasks (e.g., Wikitext using sliding-window log-likelihood), processing is very slow with data and model parallel setup. This is likely due to logits being sent to rank 0 and offloaded to CPU before applying softmax. The problem is more severe with larger batch sizes.
  • High memory usage was observed in general. For example, with the Qwen 1.5B model and batch size 3 per GPU, memory spikes to nearly 100% during evaluation.

@bigximik bigximik changed the title [WIP] Add tensor parallelism (and general model/sequence parallelism) support for HF wrapper forward and lm_eval integration [WIP] Add tensor parallelism support for HF wrapper forward and lm_eval integration Aug 6, 2025
@bigximik bigximik changed the title [WIP] Add tensor parallelism support for HF wrapper forward and lm_eval integration Add tensor parallelism support for HF wrapper forward and lm_eval integration Aug 20, 2025
@bigximik bigximik marked this pull request as ready for review August 20, 2025 12:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support Tensor Parallelism in inference
1 participant