Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

bigximik · 2025-07-30T11:50:10Z

✨ Description

Add tensor parallelism support for HF wrapper forward and lm_eval integration

Closes #334

🔍 Type of change

Select all that apply:

🐛 Bug fix (non-breaking change that addresses a specific issue)
🚀 New feature (non-breaking change that adds functionality)
⚠️ Breaking change (a change that could affect existing functionality)
📈 Performance improvement/optimization (improves speed, memory usage, or efficiency)
🛠️ Code refactor (non-functional changes that improve code readability, structure, etc.)
📦 Dependency bump (updates dependencies, including Dockerfile or package changes)
📝 Documentation change (updates documentation, including new content or typo fixes)
🔧 Infrastructure/Build change (affects build process, CI/CD, or dependencies)

📝 Changes

Key updates introduced in this PR:

Fixed a bug where the batch config was being read from the wrong place.
Added additional broadcast primitives and optimized _object_to_tensor for faster performance (following PyTorch sources).
Added tensor-parallel logits collection in the model head.
Added tensor-parallel support to forward.
Added coordinator-forward mode, which allows generate to run only on data-parallel leader ranks while tensor parallel workers participate through worker_forward.
Added model and pipeline parallelism support to the lm_eval wrapper.
Added wait barriers in critical places, as the standard 60s timeout on distributed primitives was insufficient in cases such as slow post-processing for some lm_eval tasks or when batches are incomplete and some data-parallel ranks have no data.

🗒️ Notes and Known Issues

Manually tested with DP and TP on 2 GPUs, DP+TP on 4 GPUs, and also on a single GPU.
There is a problem with CUDA memory fragmentation, potentially caused by scattering and broadcasting tensors of different size.
For some tasks (e.g., Wikitext using sliding-window log-likelihood), processing is very slow with data and model parallel setup. This is likely due to logits being sent to rank 0 and offloaded to CPU before applying softmax. The problem is more severe with larger batch sizes.
High memory usage was observed in general. For example, with the Qwen 1.5B model and batch size 3 per GPU, memory spikes to nearly 100% during evaluation.

…rted in forward

…aster according to torch src

…e to right gpu after scatter

…uate_tp

bigximik added 2 commits July 30, 2025 11:36

added model and sequence parallel to forward

1b60413

added asserts for pipeline and sequence parallel to be 1 as not suppo…

5eac621

…rted in forward

bigximik changed the title ~~[WIP] Add tensor parallelism (and general model/sequence parallelism) support for HF wrapper forward and lm_eval integration~~ [WIP] Add tensor parallelism support for HF wrapper forward and lm_eval integration Aug 6, 2025

bigximik added 7 commits August 7, 2025 12:19

changed logits gathering for only tp and stp dimensions

d882d7b

added more broadcast primitives and changed _object_to_tensor to be f…

b9851c2

…aster according to torch src

added support to TP in forward for generate

750ea1c

added suppport to other parallelism additionally to data parallelism

0f196da

removed out of date comment

82b901d

added extended wait in key places, fix to right batch config, fix mov…

543f3d6

…e to right gpu after scatter

added more docs

be8050c

bigximik changed the title ~~[WIP] Add tensor parallelism support for HF wrapper forward and lm_eval integration~~ Add tensor parallelism support for HF wrapper forward and lm_eval integration Aug 20, 2025

bigximik requested review from tscholak and jlamypoirier August 20, 2025 12:12

bigximik marked this pull request as ready for review August 20, 2025 12:18

Merge branch 'main' of github.com:ServiceNow/Fast-LLM into denis/eval…

d51f584

…uate_tp

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Uh oh!

bigximik commented Jul 30, 2025 •

edited

Loading

Uh oh!

Uh oh!

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Are you sure you want to change the base?

Add tensor parallelism support for HF wrapper forward and lm_eval integration #340

Uh oh!

Conversation

bigximik commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✨ Description

🔍 Type of change

📝 Changes

🗒️ Notes and Known Issues

Uh oh!

Uh oh!

bigximik commented Jul 30, 2025 •

edited

Loading