Skip to content

Conversation

zhaozheng09
Copy link

we have a host bound model on A100
I found that there would be a long wait when doing embedding key preprocessing, which would cause the subsequent CPU scheduling to not keep up with the GPU operation, resulting in insufficient GPU utilization.
so we move unique and wait op from main thread to other thread .
before:
image
after:
image

I have roughly proposed a multi-threaded solution, but I don't know if it is feasible. If it is feasible, I will provide additional and complete code.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 11, 2025
@TroyGarden
Copy link
Contributor

Hi @zhaozheng09 , thanks for the PR. wondering if you can share the trace files (before vs after) so that we can have better context. Usually the data preprocessing (PreProc) is done in a remote worker, but we'll consider your use case as well.

@zhaozheng09
Copy link
Author

Hi @zhaozheng09 , thanks for the PR. wondering if you can share the trace files (before vs after) so that we can have better context. Usually the data preprocessing (PreProc) is done in a remote worker, but we'll consider your use case as well.

<File size too big: 25 MB are allowed, 1066 MB were attempted to upload. > Are there other ways to discuss this? or do I need to add any additional information?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants