-
Notifications
You must be signed in to change notification settings - Fork 684
[aoti-et] Enable multimodal runner for Voxtral on CUDA #14980
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/14980
Note: Links to docs will display an error until the docs builds have been completed. ❌ 4 New Failures, 17 PendingAs of commit be5d187 with merge base 66c3dea ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thansk for your great work!
The size/stride change for me is pretty strange: i con't image a case that the tensor ptr keeps the same while its size/stride got changed
std::vector<int64_t> strides(tensor->dim()); | ||
auto tensor_strides = tensor->strides(); | ||
for (ssize_t i = 0; i < tensor->dim(); i++) { | ||
strides[i] = static_cast<int64_t>(tensor_strides[i]); | ||
} | ||
auto it = | ||
internal::tensor_to_strides.insert_or_assign(tensor, std::move(strides)) | ||
.first; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why are we now allocating a vector unconditionally? this seems less efficient than the old code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i just believe the original logic is too complex.
We can also do inplace update: if there's tensor in the map, we can reuse the vector instead of creating a new one.
this will have same order of memory consumption, while make the code logic cleaner
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the branch on whether the tensor is already present in the map before creating and filling out a new vector is very important; it's the difference between doing a heap allocation once and doing it every time.
sizes[i] = tensor_sizes[i]; | ||
} | ||
it = internal::tensor_to_sizes.emplace(tensor, std::move(sizes)).first; | ||
std::vector<int64_t> sizes(tensor->dim()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
// bfloat16 is the upper 16 bits of float32 | ||
uint32_t float_bits; | ||
std::memcpy(&float_bits, &float_data[i], sizeof(float)); | ||
|
||
// Rounding: add 0x7FFF to round to nearest even | ||
uint32_t rounding_bias = 0x7FFF + ((float_bits >> 16) & 1); | ||
bf16_data[i] = static_cast<uint16_t>((float_bits + rounding_bias) >> 16); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why can't we use ExecuTorch's BFloat16 class (which is c10::BFloat16 underneath) for this?
This pull request introduces changes to the CUDA workflow, model artifact handling, and multimodal runner logic. The main changes include restructuring the GitHub Actions workflow to separate model export, benchmarking, and end-to-end testing for the Voxtral CUDA pipeline, improving artifact management and reproducibility. Additionally, the multimodal runner now supports automatic conversion of audio tensors to bfloat16, ensuring compatibility with expected input types. There are also enhancements to caching and symbol registration in the CUDA backend, and build system updates to support linking the CUDA backend.
Workflow and Artifact Management Improvements:
.github/workflows/cuda.yml
to split the Voxtral CUDA pipeline into three jobs:export-voxtral-cuda-artifact
(exports and stores model artifacts),benchmark-voxtral-cuda
(benchmarks using exported artifacts), andtest-voxtral-cuda-e2e
(runs full end-to-end tests with artifact download and audio input). Improved artifact handling, reproducibility, and added explicit checks for required files. [1] [2] [3] [4] [5]Multimodal Runner Logic:
MultimodalPrefiller::prefill
and implemented a helper functionconvert_to_bfloat16
inutil.h
to support this. This ensures that audio inputs match the expected dtype for the encoder, improving robustness for multimodal inference. [1] [2]CUDA Backend and Caching Enhancements:
common_shims.cpp
for tensor strides and sizes by validating cached values and updating them when necessary. This prevents stale cache issues and ensures correct tensor metadata. [1] [2]CudaBackend
to handle multiple shared objects in the same process, ensuring correct execution when switching between models.Build System Updates:
CMakeLists.txt
andexecutorch-config.cmake
to include and link the CUDA backend (aoti_cuda
) when building Voxtral and other components, improving build flexibility and CUDA support. [1] [2]Debugging and Tuning Options:
cuda_backend.py
via theDEBUG
environment variable, allowing easier troubleshooting and development.