Nano Block Diffusion

Lightweight codebase for training a Block Diffusion model from scratch in a few hours. Includes model definition, training loop, and inference script—designed as a foundation for further experimentation.

Requirements: CUDA 12.6, NVIDIA GPU with ≥24 GB VRAM - works with consumer hardware (e.g. RTX 3090) and server clusters (e.g. 8×H100).

Block Diffusion

Executive summary of how Block Diffusion differs from a GPT-style transformer:

Block Diffusion: Sample a noise level per block_size block, independently mask tokens at that probability, then reconstruct the entire block from its context in one shot rather than token-by-token.
Attention Rules:
- Within noisy block tokens attend to one another fully.
- Noisy blocks attend to earlier, clean blocks.
- Clean blocks follow standard causal masking.
Loss Function: Apply cross-entropy only on masked tokens, weighting each block’s loss by the inverse of its noise level before averaging.

In commit dcda272, all core modifications required to transform nanoGPT into a Block Diffusion model are implemented.

Training Instructions

Clone & install dependencies

git clone https://github.com/lapp0/nano-block-diffusion.git && cd nano-block-diffusion
pip install -r requirements.txt
pip install --pre torch==2.8.0.dev20250523+cu126 torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu126 --upgrade

Download training data

python data/cached_finewebedu10B.py 10  # download input data

Launch training on all GPUs

torchrun --standalone --nproc_per_node=$(nvidia-smi -L | wc -l) train.py

Training Records

Implementation	Train Tokens	Date	Note
Paper[1]	65,000M	05/17/2025	Table 3, 30.60 PPL on LM1B
Nano Block Diffusion v0	1,311M	07/09/2025	Original release
New Record

PRs which improve the models training performance are encouraged. Block Diffusion models are new and underexplored.

New Record Rules

Parameter limit: Use ≤ 162M parameters (including embeddings).
Target: Achieve ≤ 3.44 cross-entropy loss on FineWebEdu validation set.
Data: Must use FineWeb-Edu dataset. Sample order are fixed. Samples cannot be repeated. Sample size per batch may vary.
Objective: Must retain the same objective function and retain 16 token denoising block.

Notes

Inspired by Andrej Karpathy’s nanoGPT. Optimizer and architectural enhancements borrowed from Muon and modded-nanogpt.

References

[1]: Arriola, M., Gokaslan, A., Chiu, J.T., Yang, Z., Qi, Z., Han, J., Sahoo, S.S., Kuleshov, V. (2025). Block Diffusion: Interpolating Between Autoregressive and Diffusion Language Models. arXiv preprint arXiv:2503.09573.
[2]: Keller Jordan et al. modded-nanogpt: Speedrunning the NanoGPT baseline.
- Note: many improvements in model.py are from modded-nanogpt, however, comments attributing the discovering author have been removed to keep the codebase clean. To view each improvements discovering author comment, see modded-nanogpt's train_gpt.py

Name		Name	Last commit message	Last commit date
Latest commit History 1,181 Commits
data		data
logs		logs
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
model.py		model.py
requirements.txt		requirements.txt
train.py		train.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Nano Block Diffusion

Block Diffusion

Training Instructions

Training Records

Notes

References

About

Uh oh!

Releases

Packages

Contributors 12

Uh oh!

Languages

License

lapp0/nano-block-diffusion

Folders and files

Latest commit

History

Repository files navigation

Nano Block Diffusion

Block Diffusion

Training Instructions

Training Records

Notes

References

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 12

Uh oh!

Languages

Packages