QLoRA 4-bit quantization #1595

unbounded · 2023-05-25T23:04:25Z

unbounded
May 25, 2023

It would be interesting to compare this approach to the quantization in llama.cpp:

https://huggingface.co/blog/4bit-transformers-bitsandbytes

We present QLORA, an efficient finetuning approach that reduces memory usage enough to finetune a 65B parameter model on a single 48GB GPU while preserving full 16-bit finetuning task performance. QLORA backpropagates gradients through a frozen, 4-bit quantized pretrained language model into Low Rank Adapters (LoRA).

As I understand it, the main idea is to fine tune the model with a LoRA on each layer after 4-bit quantization, to restore performance to pre-quantization levels.

This could probably be applied to a GGML quantized model as well - either by doing the actual fine tuning in GGML or by training in Python and exporting the LoRA.

Some additional techniques they claim helps generation quality:

NormalFloat quantization

4-bit NormalFloat Quantization The NormalFloat (NF) data type builds on Quantile Quantization
[15] which is an information-theoretically optimal data type that ensures each quantization bin has an
equal number of values assigned from the input tensor.

This sounds similar to what @MarcioPais experimented with in #397 (comment), where they said:

It is however not worth it at all, as the non-linear mappings help with RMSE and MAE, but do basically nothing for improving perplexity, which is disappointing.

It is interesting that the paper calls this out as a clear improvement. Some possibilities I can think of:

Quantile Quantization was not one of the non-linear mappings tried in Investigate alternative approach for Q4 quantization #397
Quantile Quantization does not directly help perplexity, but preserves information that can be used by the LoRA.

Double Quantization

Very similar to the super blocks @ikawrakow uses in #1256. The paper uses a 8-bit scale value for every 64 4-bit weights, and a 32-bit scale for every 256 8-bit scales.

Other notes

They don't show any results for 3-bit quantization, seems like an obvious next step.

ggerganov · 2023-05-26T06:35:26Z

ggerganov
May 26, 2023
Maintainer

This could probably be applied to a GGML quantized model as well - either by doing the actual fine tuning in GGML

I agree - this is a very interesting area for experiments.

User @xaedes has laid the foundation for training with the baby-llama example and is also making very interesting progress at full-text training: ggml-org/ggml#8 (comment)

CPU-based training / fine-tuning with quantization support could be very useful since it is much easier to afford a >128GB machine. We also have the mechanism to offload part of the computations to the GPU if necessary to get a bit of extra performance. In general, it looks like we have a good opportunity for demonstrating ggml-based fine tuning

It is interesting that the paper calls this out as a clear improvement.

The author actually acknowledged that GPTQ quantization is superior to NF4: https://twitter.com/Tim_Dettmers/status/1661482614811918338

0 replies

phdykd · 2023-06-03T05:29:33Z

phdykd
Jun 3, 2023

I am getting this issue " No GPU found. A GPU is needed for quantization." for the following code snippet trying to be implemented in M2 MacOS that has 12 CPU and 38 GPU in it. How is QLORA/quantization will work on M2 MacOS systems that has "mps" in it?

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, BitsAndBytesConfig

model_id = "EleutherAI/gpt-neox-20b"
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16
)

tokenizer = AutoTokenizer.from_pretrained(model_id)

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config) #, device_map={"":0})

model = AutoModelForCausalLM.from_pretrained(model_id, quantization_config=bnb_config, device=torch.device('cpu'))

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

QLoRA 4-bit quantization #1595

Uh oh!

{{title}}

Uh oh!

Replies: 4 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

QLoRA 4-bit quantization #1595

Uh oh!

unbounded May 25, 2023

Replies: 4 comments · 1 reply

Uh oh!

ggerganov May 26, 2023 Maintainer

Uh oh!

phdykd Jun 3, 2023

unbounded
May 25, 2023

Replies: 4 comments 1 reply

ggerganov
May 26, 2023
Maintainer

phdykd
Jun 3, 2023