Is the embedding weight for models which uses weight tying is being duplicated for "offloading"? #6451
-
Hello. I am not an expert in C++. But from what I understand from the code is that the embedding weight is duplicated for models which uses weight tying like Gemma 7B(8.5B). model.output = ml.create_tensor(ctx_output, tn(LLM_TENSOR_TOKEN_EMBD, "weight"), {n_embd, n_vocab}); // same as tok_embd, duplicated to allow offloading If this is really the case, I think this will increase the model size significantly with worse quality. I was wondering why I always get OOM when trying to load Gemma 7B(actually 8.5B) on my GPU and probably this might be the case. The shape of this tensor is (256000 x 3072)!! I am not entirely sure so I thought of asking here first before opening an issue. |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
It is stored one time in CPU RAM for the input token embeddings ( |
Beta Was this translation helpful? Give feedback.
It is stored one time in CPU RAM for the input token embeddings (
ctx_input
) and one more time here in the GPU RAM for the output (ctx_output
). So the answer is that it is duplicated, one of the copies is in RAM and the other is in VRAM