Yunfei Li2, Siliang Tang1, Jun Xiao1, Fei Wu1, Hang Zhao2, Yueting Zhuang1
1Zhejiang University, 2Ant Group
*Equal Contribution, ‡Project Leader, †Corresponding Authors
- [July 23, 2025] We have released the checkpoint and training data of Janus-Pro-R1.
- [June 18, 2025] We have released the training and inference scripts of Janus-Pro-R1.
- [June 2, 2025] Our paper is now available on arXiv: Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation.
- Release the paper
- Release training scripts
- Release inference scripts
- Release training data
- Release Janus-Pro-R1 checkpoint
We propose a two-stage training paradigm to enable introspective text-to-image generation via genuine reasoning chains (CoT), unlocking what we call Aha Moments in visual generation:
- 
Stage 1 – Supervised Fine-Tuning (SFT): 
 The model learns structured visual reasoning through three subtasks:- Text-to-image generation
- Image-text consistency self-evaluation
- Image regeneration through reflection
 
- 
Stage 2 – Reinforcement Learning (RL): 
 The model is trained using a token-level Markov decision process with bi-level QA-based rewards to encourage spontaneous reasoning and correction, optimizing via GRPO.
With self-reflective capabilities, this approach bridges the gap between text-to-image generation and image editing, enabling a unified and coherent visual reasoning process.
1. Prepare Environment
We recommend using Python>=3.10 and setting up a virtual environment:
# clone our repo
git clone https://github.com/wendell0218/Janus-Pro-R1.git
cd Janus-Pro-R1
# prepare python environment for sft
conda create -n janus-pro-r1-sft python=3.11
conda activate janus-pro-r1-sft
pip install -r requirements-sft.txt
# prepare python environment for rl
conda create -n janus-pro-r1-rl python=3.11
conda activate janus-pro-r1-rl
pip install -r requirements-rl.txt2. Prepare Pretrained Model
Janus-Pro-R1 utilizes Janus-Pro-7B as the pretrained model for subsequent supervised fine-tuning. You can download the corresponding model using the following command:
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/deepseek-ai/Janus-Pro-7B
cd Janus-Pro-7B
git lfs pullThe SFT training data for introspective text-to-image generation is released in https://huggingface.co/datasets/midbee/Janus-Pro-R1-Data.
You can perform SFT for Text-to-Image Generation using the following command:
cd janus-sft
python launch.py --args_yml_fn configs/t2i_generation.ymlAdditionally, you can using the following command to SFT Image Editing:
cd janus-sft
python launch.py --args_yml_fn configs/image_editing.ymlFor a more detailed introduction of the Supervised Fine-Tuning stage, please refer to here.
You can perform RL for Text-to-Image Generation using the following command:
cd janus-rl/src/open_r1
export ACCELERATE_CONFIG=../../recipes/accelerate_configs/zero2.yaml
export GRPO_CONFIG=../../recipes/t2i_generation/grpo.yml
export NUM_PROCESSES=8
accelerate launch \
  --config_file $ACCELERATE_CONFIG \
  --num_processes $NUM_PROCESSES \
  grpo_t2i.py \
  --config $GRPO_CONFIGAdditionally, you can use the following command for RL on Image Editing:
cd janus-rl/src/open_r1
export ACCELERATE_CONFIG=../../recipes/accelerate_configs/zero2.yaml
export GRPO_CONFIG=../../recipes/image_editing/grpo.yml
export NUM_PROCESSES=8
accelerate launch \
  --config_file $ACCELERATE_CONFIG \
  --num_processes $NUM_PROCESSES \
  grpo_editing.py \
  --config $GRPO_CONFIGFor a more detailed introduction of the Reinforcement Learning stage, please refer to here.
We illustrate the inference process of introspective text-to-image generation under the simplest scenario, where the model performs a one-time image self-evaluation and image regeneration after the initial text-to-image generation.
First please prepare the model Janus-Pro-R1-7B, which utilizes Janus-Pro-7B as the backbone model. You can download the corresponding model from 🤗https://huggingface.co/midbee/Janus-Pro-R1-7B.
You can conduct the inference process using the following command. model_path refers to the local path where you have downloaded Janus-Pro-R1-7B.
  python inference/inference.py \
      --model_path $CKPT_PATH \
      --caption "a brown giraffe and a white stop sign" \
      --gen_path "results/samples" \
      --reason_path "results/reason.jsonl" \
      --regen_path "results/regen_samples" \
      --cfg 5.0 \
      --parallel_size 4After completing the inference, the structure of the results directory will be as follows:
results/
├── reason.jsonl
├── samples/
│   ├── 0000.png
│   ├── 0001.png
│   ├── 0002.png
│   └── 0003.png
└── regen_samples/
    ├── 0000.png
    ├── 0001.png
    ├── 0002.png
    └── 0003.png
For a more detailed introduction for inference, please refer to here.
Our project is developed based on the following repositories:
- Janus-Series: Unified Multimodal Understanding and Generation Models
- Open-R1: Fully open reproduction of DeepSeek-R1
If you find this work useful for your research, please cite our paper and star our git repo:
@article{pan2025unlocking,
  title={Unlocking Aha Moments via Reinforcement Learning: Advancing Collaborative Visual Comprehension and Generation},
  author={Pan, Kaihang and Wu, Yang and Bu, Wendong and Shen, Kai and Li, Juncheng and Wang, Yingting and Li, Yunfei and Tang, Siliang and Xiao, Jun and Wu, Fei and others},
  journal={arXiv preprint arXiv:2506.01480},
  year={2025}
}
@article{pan2025focusdiff,
  title={FocusDiff: Advancing Fine-Grained Text-Image Alignment for Autoregressive Visual Generation through RL},
  author={Pan, Kaihang and Bu, Wendong and Wu, Yuruo and Wu, Yang and Shen, Kai and Li, Yunfei and Zhao, Hang and Li, Juncheng and Tang, Siliang and Zhuang, Yueting},
  journal={arXiv preprint arXiv:2506.05501},
  year={2025}
}




