Skip to content

Is the embedding weight for models which uses weight tying is being duplicated for "offloading"? #6451

Answered by ggerganov
qnixsynapse asked this question in Q&A
Discussion options

You must be logged in to vote

It is stored one time in CPU RAM for the input token embeddings (ctx_input) and one more time here in the GPU RAM for the output (ctx_output). So the answer is that it is duplicated, one of the copies is in RAM and the other is in VRAM

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@qnixsynapse
Comment options

qnixsynapse Apr 3, 2024
Collaborator Author

Answer selected by qnixsynapse
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants