Releases: bitsandbytes-foundation/bitsandbytes
0.48.0: Intel GPU & Gaudi support, CUDA 13, performance improvements, and more!
Highlights
🎉 Intel GPU Support
We now officially support Intel GPUs on Linux and Windows! Support is included for all major features (LLM.int8(), QLoRA, 8bit optimizers) with the exception of the paged optimizer feature.
This support includes the following hardware:
- Intel® Arc™ B-Series Graphics
- Intel® Arc™ A-Series Graphics
- Intel® Data Center GPU Max Series
A compatible PyTorch version with Intel XPU support is required. The current minimum is PyTorch 2.6.0. It is recommended to use the latest stable release. See Getting Started on Intel GPU for guidance.
🎉 Intel Gaudi Support
We now officially support Intel Gaudi2 and Gaudi3 accelerators. This support includes LLM.int8() and QLoRA with the NF4 data type. At this time optimizers are not implemented.
A compatible PyTorch version with Intel Gaudi support is required. The current minimum is Gaudi v1.21 with PyTorch 2.6.0. It is recommended to use the latest stable release. See the Gaudi software installation guide for guidance.
NVIDIA CUDA
- The 4bit dequantization kernel was improved by @Mhmd-Hisham in #1746. This change brings noticeable speed improvements for prefill, batch token generation, and training. The improvement is particularly prominent on A100, H100, and B200.
- We've added CUDA 13.0 compatibility across Linux x86-64, Linux aarch64, and Windows x86-64 platforms.
- Hardware support for CUDA 13.0 is limited to Turing generation and newer.
- Support for Thor (SM110) is available in the Linux aarch64 build.
🚨 Breaking Changes
- Dropped support for PyTorch 2.2. The new minimum requirement is 2.3.0.
- Removed Maxwell GPU support for all CUDA builds.
What's Changed
- add py.typed by @cyyever in #1726
- Enable F841 by @cyyever in #1727
- add int mm for xpu after torch 2.9 by @jiqing-feng in #1736
- for intel xpu case, use MatMul8bitFp even not use ipex by @kaixuanliu in #1728
- 4bit quantization for arbitrary
nn.Parameter
by @matthewdouglas in #1720 - Adjust 4bit test tolerance on CPU for larger blocksizes by @matthewdouglas in #1749
- Test improvements by @matthewdouglas in #1750
- [XPU] Implemented 32bit optimizers in triton by @YangKai0616 in #1710
- Add SYCL Kernels for XPU backend by @xiaolil1 in #1679
- [XPU] Implemented 8bit optimizers in triton by @Egor-Krivov in #1692
- Drop Maxwell (sm50) build from distribution by @matthewdouglas in #1755
- Bump minimum PyTorch to 2.3 by @matthewdouglas in #1754
- [CUDA] Branchless NF4/FP4 kDequantizeBlockwise kernel for faster dequantization by @Mhmd-Hisham in #1746
- Update log by @YangKai0616 in #1758
- Add function to reverse 4bit weights for HPU by @vivekgoe in #1757
- Add CUDA 13.0 Support by @matthewdouglas in #1761
- Fix for warpSize deprecation in ROCm 7.0 by @pnunna93 in #1762
- Build/Package Intel XPU binary for Linux by @matthewdouglas in #1763
- Update workflow for packaging by @matthewdouglas in #1766
- Add Thor support by @jasl in #1764
- ROCm: Add 6.4 and 7.0 builds by @matthewdouglas in #1767
- Linear8bitLt: support device movement after forward() by @matthewdouglas in #1769
New Contributors
- @cyyever made their first contribution in #1726
- @kaixuanliu made their first contribution in #1728
- @YangKai0616 made their first contribution in #1710
- @xiaolil1 made their first contribution in #1679
- @vivekgoe made their first contribution in #1757
- @jasl made their first contribution in #1764
Full Changelog: 0.47.0...0.48.0
Latest `main` wheel
Latest main
pre-release wheel
This pre-release contains the latest development wheels for all supported platforms, rebuilt automatically on every commit to the main
branch.
How to install:
Pick the correct command for your platform and run it in your terminal:
Linux (ARM/aarch64)
pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_aarch64.whl
Linux (x86_64)
pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-manylinux_2_24_x86_64.whl
Windows (x86_64)
pip install --force-reinstall https://github.com/bitsandbytes-foundation/bitsandbytes/releases/download/continuous-release_main/bitsandbytes-1.33.7.preview-py3-none-win_amd64.whl
Note:
These wheels are updated automatically with every commit tomain
and become available as soon as the python-package.yml workflow finishes.
The version number is replaced with 1.33.7-preview in order to keep the link stable, this however does not affect the installed version at all:
> pip install https://.../bitsandbytes-1.33.7-preview-py3-none-manylinux_2_24_x86_64.whl
Collecting bitsandbytes==1.33.7rc0
...
Successfully installed bitsandbytes-0.46.0.dev0
0.47.0
Highlights:
- FSDP2 compatibility for Params4bit (#1719)
- Bugfix for 4bit quantization with large block sizes (#1721)
- Further removal of previously deprecated code (#1669)
- Improved CPU coverage (#1628)
- Include NVIDIA Volta support in CUDA 12.8 and 12.9 builds (#1715)
What's Changed
- Enable CPU/XPU native and ipex path by @jiqing-feng in #1628
- Fix CI regression by @matthewdouglas in #1666
- Add CPU + IPEX to nightly CI by @matthewdouglas in #1667
- Fix params4bit passing bnb quantized by @mklabunde in #1665
- Deprecation cleanup by @matthewdouglas in #1669
- CI workflow: bump torch 2.7.0 to 2.7.1 by @matthewdouglas in #1670
- Improvement for torch.compile support on Params4bit by @matthewdouglas in #1673
- Fixed a bug in test_fw_bit_quant testing on CPU by @Egor-Krivov in #1675
- doc fix signature for 8-bit optim by @ved1beta in #1660
- Apply clang-format rules by @matthewdouglas in #1678
- Add clang-format by @matthewdouglas in #1677
- HPU (Intel gaudi) support for bnb unit tests by @ckvermaAI in #1680
- CI: Setup HPU nightly tests by @matthewdouglas in #1681
- Update test_kbit_backprop unit test by @ckvermaAI in #1682
- Update README.md by @matthewdouglas in #1684
- Enable ROCm backend with custom ops integration by @pnunna93 in #1683
- Fix AdamW documentation by @agupta2304 in #1686
- Make minor improvements to optimizer.py by @agupta2304 in #1687
- Add CUDA 12.9 build by @matthewdouglas in #1689
- CI: Test with PyTorch 2.8.0 RC by @matthewdouglas in #1693
- Automatically call CMake as part of PEP 517 build by @mgorny in #1512
- fix log by @jiqing-feng in #1697
- [XPU] Add inference benchmark for XPU by @Egor-Krivov in #1696
- Add kernel registration for 8bit and 32bit optimizers by @Egor-Krivov in #1706
- Create FUNDING.yml by @matthewdouglas in #1714
- Add Volta support in cu128/cu129 builds by @matthewdouglas in #1715
- Fix Params4bit tensor subclass handling by @ved1beta in #1719
- [CUDA] Fixing quantization uint8 packing bug for NF4 and FP4 by @Mhmd-Hisham in #1721
New Contributors
- @mklabunde made their first contribution in #1665
- @agupta2304 made their first contribution in #1686
- @mgorny made their first contribution in #1512
- @Mhmd-Hisham made their first contribution in #1721
Full Changelog: 0.46.0...0.47.0
0.46.1
What's Changed
- Fix params4bit passing bnb quantized by @mklabunde in #1665
- Improvement for torch.compile support on Params4bit by @matthewdouglas in #1673
- doc fix signature for 8-bit optim by @ved1beta in #1660
- Fix AdamW documentation by @agupta2304 in #1686
- Make minor improvements to optimizer.py by @agupta2304 in #1687
- Add CUDA 12.9 build by @matthewdouglas in #1689
- Automatically call CMake as part of PEP 517 build by @mgorny in #1512
New Contributors
- @mklabunde made their first contribution in #1665
- @agupta2304 made their first contribution in #1686
- @mgorny made their first contribution in #1512
Full Changelog: 0.46.0...0.46.1
0.46.0: torch.compile() support; custom ops refactor; Linux aarch64 wheels
Highlights
- Support for
torch.compile
without graph breaks for LLM.int8().- Compatible with PyTorch 2.4+, but PyTorch 2.6+ is recommended.
- Experimental CPU support is included.
- Support
torch.compile
without graph breaks for 4bit.- Compatible with PyTorch 2.4+ for
fullgraph=False
. - Requires PyTorch 2.8 nightly for
fullgraph=True
.
- Compatible with PyTorch 2.4+ for
- We are now publishing wheels for CUDA Linux aarch64 (sbsa)!
- Targets are Turing generation and newer: sm75, sm80, sm90, and sm100.
- PyTorch Custom Operators refactoring and integration:
- We have refactored most of the library code to integrate better with PyTorch via the
torch.library
and custom ops APIs. This helps enable ourtorch.compile
and additional hardware compatibility efforts. - End-users do not need to change the way they are using
bitsandbytes
.
- We have refactored most of the library code to integrate better with PyTorch via the
- Unit tests have been cleaned up for increased determinism and most are now device-agnostic.
- A new nightly CI runs unit tests for CPU (Windows x86-64, Linux x86-64/aarch64) and CUDA (Linux/Windows x86-64).
Compatability Changes
- Support for Python 3.8 is dropped.
- Support for PyTorch < 2.2.0 is dropped.
- CUDA 12.6 and 12.8 builds are now compatible for
manylinux_2_24
(previouslymanylinux_2_34
). - Many APIs that were previously marked as deprecated have now been removed.
- New deprecations:
- bnb.autograd.get_inverse_transform_indices()
- bnb.autograd.undo_layout()
- bnb.functional.create_quantile_map()
- bnb.functional.estimate_quantiles()
- bnb.functional.get_colrow_absmax()
- bnb.functional.get_row_absmax()
- bnb.functional.histogram_scatter_add_2d()
What's Changed
- PyTorch Custom Operator Integration by @matthewdouglas in #1544
- Bump CUDA 12.8.0 build to CUDA 12.8.1 by @matthewdouglas in #1575
- Drop Python 3.8 support. by @matthewdouglas in #1574
- Test cleanup by @matthewdouglas in #1576
- Fix: Return tuple in get_cuda_version_tuple by @DevKimbob in #1580
- Fix torch.compile issue for LLM.int8() with threshold=0 by @matthewdouglas in #1581
- fix for missing cpu lib by @Titus-von-Koeller in #1585
- Fix #1588 - torch compatability for <=2.4 by @matthewdouglas in #1590
- Add autoloading for backend packages by @matthewdouglas in #1593
- Specify blocksize by @cyr0930 in #1586
- fix typo getitem by @ved1beta in #1597
- fix: Improve CUDA version detection and error handling by @ved1beta in #1599
- Support LLM.int8() inference with torch.compile by @matthewdouglas in #1594
- Updates for device agnosticism by @matthewdouglas in #1601
- Stop building for CUDA toolkit < 11.8 by @matthewdouglas in #1605
- fix intel cpu/xpu installation by @jiqing-feng in #1613
- Support 4bit torch.compile fullgraph with PyTorch nightly by @matthewdouglas in #1616
- Improve torch.compile support for int8 with torch>=2.8 nightly by @matthewdouglas in #1617
- Add simple op implementations for CPU by @matthewdouglas in #1602
- Set up nightly CI for unit tests by @matthewdouglas in #1619
- point to correct latest continuous release main by @winglian in #1621
- ARM runners (faster than cross compilation qemu) by @johnnynunez in #1539
- Linux aarch64 CI updates by @matthewdouglas in #1622
- Moved int8_mm_dequant from CPU to default backend by @Egor-Krivov in #1626
- Refresh content for README.md by @matthewdouglas in #1620
- C lib loading: add fallback with sensible error msg by @Titus-von-Koeller in #1615
- Switch CUDA builds to use Rocky Linux 8 container by @matthewdouglas in #1638
- Improvements to test suite by @matthewdouglas in #1636
- Additional CI runners by @matthewdouglas in #1639
- CI runner updates by @matthewdouglas in #1643
- Optimizer backwards compatibility fix by @matthewdouglas in #1647
- General cleanup & test improvements by @matthewdouglas in #1646
- Add torch.compile tests by @matthewdouglas in #1648
- Documentation Cleanup by @matthewdouglas in #1644
- simplified non_sign_bits by @ved1beta in #1649
New Contributors
- @DevKimbob made their first contribution in #1580
- @cyr0930 made their first contribution in #1586
- @ved1beta made their first contribution in #1597
- @winglian made their first contribution in #1621
- @Egor-Krivov made their first contribution in #1626
Full Changelog: 0.45.4...0.46.0
Multi-Backend Preview
continuous-release_multi-backend-refactor update compute_type_is_set attr (#1623)
0.45.5
This is a minor release that affects CPU-only usage of bitsandbytes. The CPU build of the library was inadvertently omitted from the v0.45.4 wheels.
Full Changelog: 0.45.4...0.45.5
0.45.4
This is a minor release that affects CPU-only usage of bitsandbytes. There is one bugfix and improved system compatibility on Linux.
What's Changed
- Build: use ubuntu-22.04 instead of 24.04 for CPU build (glibc compat) by @matthewdouglas in #1538
- Fix CPU dequantization to use nested dequantized scaling constant by @zyklotomic in #1549
New Contributors
- @zyklotomic made their first contribution in #1549
Full Changelog: 0.45.3...0.45.4
0.45.3
Overview
This is a small patch release containing a few bug fixes.
Additionally, this release contains a CUDA 12.8 build which adds the sm100 and sm120 targets for NVIDIA Blackwell GPUs.
What's Changed
- Fix #1490 by @matthewdouglas in #1496
- Blackwell binaries! by @johnnynunez in #1491
- Bug fix: Update create_dynamic_map to always return a float32 tensor by @mitchellgoffpc in #1521
- Update cuda versions in error messages by @FxMorin in #1520
- QuantState.to(): move code tensor with others to correct device by @matthewdouglas in #1528
- Installation doc updates by @matthewdouglas in #1529
New Contributors
- @mitchellgoffpc made their first contribution in #1521
- @FxMorin made their first contribution in #1520
Full Changelog: 0.45.2...0.45.3
0.45.2
This patch release fixes a compatibility issue with Triton 3.2 in PyTorch 2.6. When importing bitsandbytes
without any GPUs visible in an environment with Triton installed, a RuntimeError may be raised:
RuntimeError: 0 active drivers ([]). There should only be one.
Full Changelog: 0.45.1...0.45.2