LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

News

[2025.06.30] Fast-dLLM is now supported in LLaDA-V! This integration accelerates inference latency from 60s to just 6s. Try it out here!
[2025.05.29] We open-sourced the model LLaDA-V and the code of LLaDA-V.
[2025.05.23] We have uploaded our paper to arXiv.

Introduction

We introduce LLaDA-V, a competitive diffusion-based vision-language model, outperforming other diffusion MLLMs.

Quick Inference Demo

The LLaDA-V model is now available on Hugging Face Hub. To quickly test the model with a visual instruction demo, follow these simple steps:

Clone the repository

git clone https://github.com/ML-GSAI/LLaDA-V
cd LLaDA-V/train

Initialize the environment
Run the environment setup script to install necessary dependencies:
```
bash init_env.sh
```
Run the demo script
Execute the demo script to test LLaDA-V on an example image:
```
python generate_demo.py
```

Training from LLaDA

This repository includes a complete training framework for LLaDA-V, following the LLaVA approach for visual instruction tuning.

Data Preparation

As an example, we outlined the data preparation process for training LLaDA-V using the LLaVA-NeXT dataset. You need to prepare the following datasets:

Download the LLaVA pretraining dataset from Hugging Face:

https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain/tree/main

Create the directory structure train/data/llava_pretrain and extract images.zip into the images subfolder.
Ensure your train/data/llava_pretrain directory contains both the images folder and the blip_laion_cc_sbu_558k.json file.

Download the LLaVA-NeXT dataset from Hugging Face:

https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data

Process the LLaVA-NeXT dataset by following these steps:
- Extract all tar.gz files (from llava_next_raw_format_images_1.tar.gz to llava_next_raw_format_images_11.tar.gz) from the llava_next_raw_format folder into train/data/llava_next/images
- Move the llava_next_raw_format_processed.json file to train/data/llava_next/

Further, if you want to reproduce the results of LLaDA-V, you need to further prepare the following datasets:

Download the MAmmoTH-VL dataset from Hugging Face:

https://huggingface.co/datasets/MAmmoTH-VL/MAmmoTH-VL-Instruct-12M/

Process the MAmmoTH-VL dataset by following these steps:
- Extract contents from multi_image_data and single_image_data folders to train/data/mammoth-vl/images
- Extract contents from video_data folder to train/data/mammoth-vl/videos
- Move the mammoth_si_10M.json file to train/data/mammoth-vl/mammoth_si_10M.json
- Move the mammoth_ov_2M.json file to train/data/mammoth-vl/mammoth_ov_2M.json

Download TIGER-Lab/VisualWebInstruct from Hugging Face:

https://huggingface.co/datasets/TIGER-Lab/VisualWebInstruct

Process the TIGER-Lab/VisualWebInstruct dataset by following these steps:
- Extract images.zip to train/data/visualwebinstruct/images
- Convert the VisualWebInstruct dataset from JSON Lines format (mixed_conversation.jsonl) to standard JSON format (mixed_conversation.json)
- Move mixed_conversation.json file to train/data/visualwebinstruct/mixed_conversation.json

Create the mix dataset by running:

python create_mix_data.py --normal_data train/data/mammoth-vl/mammoth_ov_2M.json --inference_data train/data/visualwebinstruct/mixed_conversation.json --output_path train/data/mix_ov_2M_vw_reasoning.json

Model Preparation

Download the pretrained LLaDA-8B-Instruct model from Hugging Face to the train/model/LLaDA-8B-Instruct directory:
```
https://huggingface.co/GSAI-ML/LLaDA-8B-Instruct
```

Convert the model checkpoint to Hugging Face format by running:

python train/llada_v_prepare/rename_checkpoint.py \
  --source_dir train/model/LLaDA-8B-Instruct \
  --target_dir train/model/LLaDA-8B-Instruct-HF
 
cp train/llada_v_prepare/files/* train/model/LLaDA-8B-Instruct-HF/

Download the pretrained Siglip2 model from Hugging Face to the train/model/siglip2-so400m-patch14-384 directory:
```
https://huggingface.co/google/siglip2-so400m-patch14-384
```

Run Scripts for training on LLaVA-NeXT

Pretrain Script:
   cd train && bash scripts/llada_v_pretrain.sh

Finetune Script:
   cd train && bash scripts/train_ablation/llada_v_sft.sh

Run Scripts for training LLaDA-V on MAmmoTH-VL

Pretrain Script:
   cd train && bash scripts/llada_v_pretrain.sh

Stage 2 Script:
   cd train && bash scripts/train_llada_v/llada_v_si_10M.sh

   cd train && bash scripts/train_llada_v/llada_v_ov_2M.sh

Stage 3 Script:
   cd train && bash scripts/train_llada_v/llada_v_vw.sh

   cd train && bash scripts/train_llada_v/llada_v_mix_ov_vw.sh

Finetune from LLaDA-V

Script: 
   cd train && bash scripts/llada_v_finetune.sh
   note: you need to add the path of "data_path", "image_folder", "video_folder" in llada_v_finetune.sh.

Evaluation

We provide the evaluation code in this repository, following the lmms-eval library.

Clone the repository

git clone https://github.com/ML-GSAI/LLaDA-V
cd LLaDA-V

Initialize the environment
Run the environment setup script to install necessary dependencies:
```
bash init_env.sh
```
Run the demo script
Execute the demo script to test LLaDA-V on an example image:
```
cd eval && bash scripts/evaluate.sh
```

Contact

If you have any questions, please feel free to contact us at [email protected].

Acknowledgments

The code is largely based on the LLaVA-NeXT, MAmmoTH-VL, lmms-eval and dLLM-cache. We thank the authors for their great work.

We are also very grateful to Chengyue for helping us adapt Fast-dLLM, which significantly accelerates the generation process.

Discussion

Feel free to scan the WeChat QR code below to participate in the discussion and stay updated with the latest progress.

Citation

@article{you2025llada,
  title={LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning},
  author={You, Zebin and Nie, Shen and Zhang, Xiaolu and Hu, Jun and Zhou, Jun and Lu, Zhiwu and Wen, Ji-Rong and Li, Chongxuan},
  journal={arXiv preprint arXiv:2505.16933},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
eval		eval
train		train
README.md		README.md
init_env.sh		init_env.sh
vx.jpg		vx.jpg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

News

Introduction

Quick Inference Demo

Training from LLaDA

Data Preparation

Model Preparation

Run Scripts for training on LLaVA-NeXT

Run Scripts for training LLaDA-V on MAmmoTH-VL

Finetune from LLaDA-V

Evaluation

Contact

Acknowledgments

Discussion

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

ML-GSAI/LLaDA-V

Folders and files

Latest commit

History

Repository files navigation

LLaDA-V: Large Language Diffusion Models with Visual Instruction Tuning

News

Introduction

Quick Inference Demo

Training from LLaDA

Data Preparation

Model Preparation

Run Scripts for training on LLaVA-NeXT

Run Scripts for training LLaDA-V on MAmmoTH-VL

Finetune from LLaDA-V

Evaluation

Contact

Acknowledgments

Discussion

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages