Skip to content

[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning

Notifications You must be signed in to change notification settings

OpenGVLab/VideoChat-R1

Repository files navigation

VideoChat-R1 & VideoChat-R1.5: Spatio-Temporal RL for Video Perception and Reasoning

🔥 Updates

  • 2025/09/26:🔥🔥🔥 We release our VideoChat-R1.5 model at Huggingface, paper, and eval code.
  • 2025/09/22: 🎉🎉🎉 Our VideoChat-R1.5 is accepted by NIPS2025.
  • 2025/04/22:🔥🔥🔥 We release our VideoChat-R1-caption at Huggingface.
  • 2025/04/14:🔥🔥🔥 We release our VideoChat-R1 and VideoChat-R1-thinking at Huggingface.
  • 2025/04/10:🔥🔥🔥 We release our VideoChat-R1 paper and code.

🎯 Performances on Video Benchmarks

alt text

Across short-form & long-form videos, temporal grounding, video reasoning, and spatio-temporal perception, the model delivers consistently stronger results

🦜 Introduction

alt text

We adopt multi-task joint RL to strengthen the model’s spatio-temporal perception and video reasoning capabilities.

alt text

During the inference process, we use the Region of Interest strategy which allows the model to gradually find the video interval of interest. By using multi-step perception, model performance increases with the number of perceptions.

Demo & Inference

Refer to hf README to inference our model.

Evaluation

See eval_scripts and lmms-eval_videochat.

Training

See training_scripts.

📄 Citation

If you find this project useful in your research, please consider cite:

@article{li2025videochatr1,
  title={VideoChat-R1: Enhancing Spatio-Temporal
Perception via Reinforcement Fine-Tuning},
  author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
  journal={arXiv preprint arXiv:2504.06958},
  year={2025}
}

@article{yan2025videochatr15,
  title={VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception},
  author={Yan, Ziang and Li, Xinhao and He, Yinan and Zhengrong Yue and Zeng, Xiangyu and Wang, Yali and Qiao, Yu and Wang, Limin and Wang, Yi},
  journal={arXiv preprint arXiv:2509.21100},
  year={2025}
}

About

[NIPS2025] VideoChat-R1 & R1.5: Enhancing Spatio-Temporal Perception and Reasoning via Reinforcement Fine-Tuning

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •