Replies: 4 comments 1 reply
-
I agree - this is a very interesting area for experiments. User @xaedes has laid the foundation for training with the baby-llama example and is also making very interesting progress at full-text training: ggml-org/ggml#8 (comment) CPU-based training / fine-tuning with quantization support could be very useful since it is much easier to afford a >128GB machine. We also have the mechanism to offload part of the computations to the GPU if necessary to get a bit of extra performance. In general, it looks like we have a good opportunity for demonstrating
The author actually acknowledged that GPTQ quantization is superior to NF4: https://twitter.com/Tim_Dettmers/status/1661482614811918338 |
Beta Was this translation helpful? Give feedback.
-
I am getting this issue " No GPU found. A GPU is needed for quantization." for the following code snippet trying to be implemented in M2 MacOS that has 12 CPU and 38 GPU in it. How is QLORA/quantization will work on M2 MacOS systems that has "mps" in it? import torch model_id = "EleutherAI/gpt-neox-20b" tokenizer = AutoTokenizer.from_pretrained(model_id) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) #, device_map={"":0}) model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device=torch.device('cpu')) |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
It would be interesting to compare this approach to the quantization in llama.cpp:
https://huggingface.co/blog/4bit-transformers-bitsandbytes
As I understand it, the main idea is to fine tune the model with a LoRA on each layer after 4-bit quantization, to restore performance to pre-quantization levels.
This could probably be applied to a GGML quantized model as well - either by doing the actual fine tuning in GGML or by training in Python and exporting the LoRA.
Some additional techniques they claim helps generation quality:
NormalFloat quantization
This sounds similar to what @MarcioPais experimented with in #397 (comment), where they said:
It is interesting that the paper calls this out as a clear improvement. Some possibilities I can think of:
Double Quantization
Very similar to the super blocks @ikawrakow uses in #1256. The paper uses a 8-bit scale value for every 64 4-bit weights, and a 32-bit scale for every 256 8-bit scales.
Other notes
They don't show any results for 3-bit quantization, seems like an obvious next step.
Beta Was this translation helpful? Give feedback.
All reactions