MIM4D

Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

Jialv Zou¹ *, Bencheng Liao^1,2 *, Qian Zhang³, Wenyu Liu¹, Xinggang Wang^{1 📧}

¹ School of EIC, HUST, ² Institute of Artificial Intelligence, HUST, ³ Horizon Robotics

(*) equal contribution, (^📧) corresponding author.

IJCV 2025

ArXiv Preprint (arXiv 2403.08760)

News

May. 22nd, 2025: The full code is released.
Apr. 25th, 2025: MIM4D is accepted to IJCV 2025!
Mar. 14th, 2024: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️

Abstract

Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information. We propose MIM4D, a novel pre-training paradigm based on dual masked image modeling (MIM). MIM4D leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that MIM4D achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including end-to-end planning (9% collision decrease), BEV segmentation (8.7% IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our work offers a new choice for learning representation at scale in autonomous driving.

Installation

conda create -n mim4d python=3.8
conda activate mim4d
conda install -y pytorch==1.13.0 torchvision==0.14.0 cudatoolkit=11.7 -c pytorch

pip install mmcv-full==1.7.1
pip install mmdet==2.28.2 mmsegmentation==0.30.0 tifffile-2023.7.10 numpy==1.19.5 protobuf==4.25.2 scikit-image==0.19.0 pycocotools==2.0.7 nuscenes-devkit==1.1.10 gpustat numba scipy pandas matplotlib Cython shapely loguru tqdm future fire yacs jupyterlab scikit-image pybind11 tensorboardX tensorboard easydict pyyaml open3d addict pyquaternion awscli timm typing-extensions==4.7.1

git clone [email protected]:hustvl/MIM4D.git
cd MIM4D
python setup.py develop --user

Data Preparation

Please follow the instruction of UVTR to prepare the dataset.

Training & Testing

You can train the model following the instructions. By modifying the following bash file, you can conduct experiments with different settings. You can also find the pretrained models here.

# train
bash ./extra_tools/dist_train_ssl.sh

# test
bash ./extra_tools/dist_test_ssl.sh

Results

UVTR

	NDS	mAP	Model
uvtrs_mim4d_vs0.1	32.6	32.3	pretrain/ckpt
uvtrs_mim4d_vs0.075	47.0	41.4	pretrain/ckpt

VAD

	L2 (m)	Col. (%)	Model
VAD_tiny	0.71	0.29	ckpt

Acknowledgements

We build our project based on

Thanks for their great works.

Citation

If you find MIM4D is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.

@article{zou2025mim4d,
  title={Mim4d: Masked modeling with multi-view video for autonomous driving representation learning},
  author={Zou, Jialv and Liao, Bencheng and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
  journal={International Journal of Computer Vision},
  pages={1--14},
  year={2025},
  publisher={Springer}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
assets		assets
configs		configs
extra_tools		extra_tools
mmdet3d		mmdet3d
projects		projects
requirements		requirements
tools		tools
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

MIM4D

Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

News

Abstract

Installation

Data Preparation

Training & Testing

Results

UVTR

VAD

Acknowledgements

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

hustvl/MIM4D

Folders and files

Latest commit

History

Repository files navigation

MIM4D

Masked Modeling with Multi-View Video for Autonomous Driving Representation Learning

News

Abstract

Installation

Data Preparation

Training & Testing

Results

UVTR

VAD

Acknowledgements

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages