- 2025/09/26:🔥🔥🔥 We release our VideoChat-R1.5 model at Huggingface, paper, and eval code.
- 2025/09/22: 🎉🎉🎉 Our VideoChat-R1.5 is accepted by NIPS2025.
- 2025/04/22:🔥🔥🔥 We release our VideoChat-R1-caption at Huggingface.
- 2025/04/14:🔥🔥🔥 We release our VideoChat-R1 and VideoChat-R1-thinking at Huggingface.
- 2025/04/10:🔥🔥🔥 We release our VideoChat-R1 paper and code.
Across short-form & long-form videos, temporal grounding, video reasoning, and spatio-temporal perception, the model delivers consistently stronger results
We adopt multi-task joint RL to strengthen the model’s spatio-temporal perception and video reasoning capabilities.
During the inference process, we use the Region of Interest strategy which allows the model to gradually find the video interval of interest. By using multi-step perception, model performance increases with the number of perceptions.
Refer to hf README to inference our model.
See eval_scripts and lmms-eval_videochat.
See training_scripts.
If you find this project useful in your research, please consider cite:
@article{li2025videochatr1,
title={VideoChat-R1: Enhancing Spatio-Temporal
Perception via Reinforcement Fine-Tuning},
author={Li, Xinhao and Yan, Ziang and Meng, Desen and Dong, Lu and Zeng, Xiangyu and He, Yinan and Wang, Yali and Qiao, Yu and Wang, Yi and Wang, Limin},
journal={arXiv preprint arXiv:2504.06958},
year={2025}
}
@article{yan2025videochatr15,
title={VideoChat-R1.5: Visual Test-Time Scaling to Reinforce Multimodal Reasoning by Iterative Perception},
author={Yan, Ziang and Li, Xinhao and He, Yinan and Zhengrong Yue and Zeng, Xiangyu and Wang, Yali and Qiao, Yu and Wang, Limin and Wang, Yi},
journal={arXiv preprint arXiv:2509.21100},
year={2025}
}