support `hf_quantizer` in cache warmup. #12043

sayakpaul · 2025-08-01T10:01:57Z

What does this PR do?

Takes the warmup function close to https://github.com/huggingface/transformers/blob/d3b8627b56caa7ca8fac113c9f28d0256db0194d/src/transformers/modeling_utils.py#L5969

I have gone ahead and also run a snippet from #11904 (comment) and noticed similar timings. So, running the snippet main and this PR branch should yield similar results (not identical because we cannot control it but the difference should be negligible).

HuggingFaceDocBuilderDev · 2025-08-01T10:09:05Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

a-r-r-o-w

Thanks, changes look good!

a-r-r-o-w · 2025-08-06T10:06:13Z

src/diffusers/quantizers/torchao/torchao_quantizer.py

+        - Use a division factor of 4 for int8 weights
+        """
+        # Original mapping for non-AOBaseConfig types
+        map_to_target_dtype = {"int4_*": 8, "int8_*": 4, "float8*": 4}


Need to handle more of these exhaustively:

https://github.com/huggingface/diffusers/tree/1082c46afa4a15c49833d67c7f1c0f3cfd7b0570/src/diffusers/quantizers/quantization_config.py#L633

Took a best guess of "8" for the unsigned int types. I think we can tackle more of these nuanced / lesser-used types as they become a bit more used. I think the int8 and fp8 types are far more common for now 👀

So, I have added a comment as well.

src/diffusers/models/modeling_utils.py

JoeGaffney · 2025-08-17T13:59:18Z

Hey, I'm guessing this is a good optimization. But probably needs documenting for the main release.

I ran into suddenly seeing my memory max out using 4bit Bnb.

2025-08-17 13:34:00,571 - INFO - GPU Memory Usage: 23.63GB / 25.77GB,  Reserved: 24.01GB, Allocated: 12.62GB, Usage: 91.72%

Which confused me for a bit and took a while to track down if something changed in bits and bytes or elsewhere. I was using total GPU usage as a metric and possibly others may use this as saturation guide. As its what's reported on OS and container level.

Cheers,
Joe

sayakpaul · 2025-08-17T14:17:11Z

This PR didn't really add any logging for what you're commenting about.

JoeGaffney · 2025-08-17T14:17:42Z

Thinking about this bit more is does not really make sense to fully saturate the reserved buffer should be something like 1.1 1.05 x of what is needed. Usually the total Allocated would sit pretty close to the model size.

For example here the actually usage was 12GB no where near the reserved.

This is sort of the point of people using the quantization they don't want to fully fill there memory. I know allocated and reserved is different could we run into problems with other processes on the machine needs some gpu memory would it cause a problem.

JoeGaffney · 2025-08-17T14:19:49Z

This PR didn't really add any logging for what you're commenting about.

Hey, this was from my own code using the latest main whit is this is now merged into.

Basically what im seeing is as soon as i load a quantized model with BnB 4bit my gpu memory gets fully reserved.

sayakpaul · 2025-08-17T14:22:59Z

I don't think reserved will cause any problem TBH. Can you check with a commit earlier than this PR and report the reserved memory?

JoeGaffney · 2025-08-17T14:27:01Z

I think it may also be spiking re-size bar I am seeing really large memory usage.

This is from the os level I am just running one flux test loading a transformer and encoder quantized. Which is in docker container.

Sure i can run from an early commit be interesting to see.

sayakpaul · 2025-08-17T14:32:56Z

And let's please try keeping things as minimal as possible so that we, maintainers, can work with a minimal snippet to reproduce the potential bug.

JoeGaffney · 2025-08-17T14:51:51Z

Sure as users we can provide examples this can take some time as often peoples code is split into more modular components and in some cases can't be publicly shared as is.

Will aim to get you something as minimal as possible.

FYI something like this changing memory allocation is prime candidate for a unit test.

I am not seeing this bahaviour one up from this
git+https://github.com/huggingface/diffusers.git@1b48db4c8fe76ffffa7382fd74d9f04d54aa5a16

4:35:02,059 - INFO - GPU Memory Usage: 16.97GB / 25.77GB,  Reserved: 15.99GB, Allocated: 12.70GB, Usage: 65.84%

And no funkiness at the OS level after the test run.

sayakpaul · 2025-08-17T15:17:01Z

Oh indeed, thanks for confirming. Please provide a snippet when you can so that we can reproduce it minimally.

sayakpaul · 2025-08-17T15:24:54Z

Cc: @asomoza as well. Could you check if you see a similar behaviour?

JoeGaffney · 2025-08-17T15:29:06Z

Minimal example just the loading

import pytest
import torch
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel


@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required")
def test_bnb_quantized_model_warmup():
    model_id = "black-forest-labs/FLUX.1-dev"
    torch_dtype = torch.bfloat16

    # Quantization config for 4-bit BNB
    quant_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch_dtype)

    # Load model (actual warmup path triggered internally)
    model = FluxTransformer2DModel.from_pretrained(
        model_id, subfolder="transformer", quantization_config=quant_config, torch_dtype=torch_dtype
    )

    # Check memory stats
    torch.cuda.reset_peak_memory_stats()
    mem_alloc = torch.cuda.memory_allocated()
    mem_reserved = torch.cuda.memory_reserved()
    print(f"Allocated: {mem_alloc/1e6:.1f} MB, Reserved: {mem_reserved/1e6:.1f} MB")

    # Assert some reasonable range 
    assert mem_alloc > 0, "Model should allocate some GPU memory"
    assert mem_reserved > 0, "Warmup should reserve some GPU memory"

git+https://github.com/huggingface/diffusers.git@1b48db4c8fe76ffffa7382fd74d9f04d54aa5a16

docker-compose exec gpu-workers pytest tests/general/test_memory_warmup.py -vs
=========================================================================== test session starts ============================================================================platform linux -- Python 3.11.12, pytest-8.4.1, pluggy-1.5.0 -- /opt/conda/bin/python3.11
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/app/workers/.hypothesis/examples'))
rootdir: /app/workers
configfile: pytest.ini
plugins: hypothesis-6.131.7, hydra-core-1.3.2, anyio-4.10.0
collected 1 item

Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 44462.59it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00,  2.39s/it]

Allocated: 6702.5 MB, Reserved: 6790.6 MB
PASSED

main

docker-compose exec gpu-workers pytest tests/general/test_memory_warmup.py -vs
=========================================================================== test session starts ============================================================================platform linux -- Python 3.11.12, pytest-8.4.1, pluggy-1.5.0 -- /opt/conda/bin/python3.11
cachedir: .pytest_cache
hypothesis profile 'default' -> database=DirectoryBasedExampleDatabase(PosixPath('/app/workers/.hypothesis/examples'))
rootdir: /app/workers
configfile: pytest.ini
plugins: hypothesis-6.131.7, hydra-core-1.3.2, anyio-4.10.0
collected 1 item

Fetching 3 files: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 36472.21it/s]
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:07<00:00,  2.36s/it]

Allocated: 6704.4 MB, Reserved: 23995.6 MB
PASSED

JoeGaffney · 2025-08-17T16:00:28Z

Testing a bit more with torchAO also.

BitsAndBytesConfig {
  "_load_in_4bit": true,
  "_load_in_8bit": false,
  "bnb_4bit_compute_dtype": "bfloat16",
  "bnb_4bit_quant_storage": "uint8",
  "bnb_4bit_quant_type": "nf4",
  "bnb_4bit_use_double_quant": false,
  "llm_int8_enable_fp32_cpu_offload": false,
  "llm_int8_has_fp16_weight": false,
  "llm_int8_skip_modules": null,
  "llm_int8_threshold": 6.0,
  "load_in_4bit": true,
  "load_in_8bit": false,
  "quant_method": "bitsandbytes"
}

Before moving to GPU Allocated: 6704.4 MB, Reserved: 23995.6 MB
After moving to GPU Allocated: 6704.4 MB, Reserved: 23995.6 MB
After moving to CPU Allocated: 0.0 MB, Reserved: 0.0 MB
PASSED

TorchAoConfig {
  "modules_to_not_convert": null,
  "quant_method": "torchao",
  "quant_type": "int8_weight_only",
  "quant_type_kwargs": {}
}

Before moving to GPU Allocated: 0.0 MB, Reserved: 0.0 MB
After moving to GPU Allocated: 12014.9 MB, Reserved: 12306.1 MB
After moving to CPU Allocated: 0.0 MB, Reserved: 0.0 MB
PASSED

import gc

import pytest
import torch
from diffusers import BitsAndBytesConfig, FluxTransformer2DModel, TorchAoConfig


@pytest.fixture(
    params=[
        BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_compute_dtype=torch.bfloat16),
        TorchAoConfig("int8_weight_only"),
    ]
)
def quant_config(request):
    return request.param


def print_gpu_memory_usage(prefix=""):
    torch.cuda.reset_peak_memory_stats()
    mem_alloc = torch.cuda.memory_allocated()
    mem_reserved = torch.cuda.memory_reserved()
    print(f"{prefix} Allocated: {mem_alloc / 1e6:.1f} MB, Reserved: {mem_reserved / 1e6:.1f} MB")


@pytest.mark.skipif(not torch.cuda.is_available(), reason="CUDA required")
def test_quantized_model_warmup(quant_config):
    model_id = "black-forest-labs/FLUX.1-dev"
    torch_dtype = torch.bfloat16

    model = FluxTransformer2DModel.from_pretrained(
        model_id, subfolder="transformer", quantization_config=quant_config, torch_dtype=torch_dtype
    )
    print(str(quant_config))
    print_gpu_memory_usage("Before moving to GPU")

    model.to("cuda")
    print_gpu_memory_usage("After moving to GPU")

    mem_alloc = torch.cuda.memory_allocated()
    mem_reserved = torch.cuda.memory_reserved()
    assert mem_alloc > 0
    assert mem_reserved > 0

    model.to("cpu")
    gc.collect()
    torch.cuda.empty_cache()
    torch.cuda.ipc_collect()

    print_gpu_memory_usage("After moving to CPU")

support hf_quantizer in cache warmup.

48209da

sayakpaul requested a review from a-r-r-o-w August 1, 2025 10:01

Merge branch 'main' into respect-hf-quantizer-cache-warmup

223bb6a

a-r-r-o-w reviewed Aug 6, 2025

View reviewed changes

sayakpaul added 4 commits August 6, 2025 19:21

Merge branch 'main' into respect-hf-quantizer-cache-warmup

5cbf07e

reviewer feedback

1e4cb0b

up

53240c6

up

f1c5093

sayakpaul requested a review from a-r-r-o-w August 6, 2025 14:00

Merge branch 'main' into respect-hf-quantizer-cache-warmup

68431f9

a-r-r-o-w approved these changes Aug 7, 2025

View reviewed changes

sayakpaul added 2 commits August 7, 2025 11:04

Merge branch 'main' into respect-hf-quantizer-cache-warmup

39c3849

Merge branch 'main' into respect-hf-quantizer-cache-warmup

631165d

sayakpaul merged commit 58bf268 into main Aug 14, 2025
34 of 35 checks passed

sayakpaul deleted the respect-hf-quantizer-cache-warmup branch August 14, 2025 13:27

support hf_quantizer in cache warmup. #12043

support hf_quantizer in cache warmup. #12043

Conversation

sayakpaul commented Aug 1, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Aug 1, 2025

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

a-r-r-o-w Aug 6, 2025

Choose a reason for hiding this comment

Uh oh!

sayakpaul Aug 6, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

JoeGaffney commented Aug 17, 2025

Uh oh!

sayakpaul commented Aug 17, 2025

Uh oh!

JoeGaffney commented Aug 17, 2025

Uh oh!

JoeGaffney commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Aug 17, 2025

Uh oh!

JoeGaffney commented Aug 17, 2025

Uh oh!

sayakpaul commented Aug 17, 2025

Uh oh!

JoeGaffney commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sayakpaul commented Aug 17, 2025

Uh oh!

sayakpaul commented Aug 17, 2025

Uh oh!

JoeGaffney commented Aug 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JoeGaffney commented Aug 17, 2025

Uh oh!

Uh oh!

support `hf_quantizer` in cache warmup. #12043

support `hf_quantizer` in cache warmup. #12043

sayakpaul Aug 6, 2025 •

edited

Loading

JoeGaffney commented Aug 17, 2025 •

edited

Loading

JoeGaffney commented Aug 17, 2025 •

edited

Loading

JoeGaffney commented Aug 17, 2025 •

edited

Loading