Jialv Zou1 *, Bencheng Liao1,2 *, Qian Zhang3, Wenyu Liu1, Xinggang Wang1 📧
1 School of EIC, HUST, 2 Institute of Artificial Intelligence, HUST, 3 Horizon Robotics
(*) equal contribution, (📧) corresponding author.
ArXiv Preprint (arXiv 2403.08760)
May. 22nd, 2025
: The full code is released.Apr. 25th, 2025
: MIM4D is accepted to IJCV 2025!Mar. 14th, 2024
: We released our paper on Arxiv. Code/Models are coming soon. Please stay tuned! ☕️
Learning robust and scalable visual representations from massive multi-view video data remains a challenge in computer vision and autonomous driving. Existing pre-training methods either rely on expensive supervised learning with 3D annotations, limiting the scalability, or focus on single-frame or monocular inputs, neglecting the temporal information. We propose MIM4D, a novel pre-training paradigm based on dual masked image modeling (MIM). MIM4D leverages both spatial and temporal relations by training on masked multi-view video inputs. It constructs pseudo-3D features using continuous scene flow and projects them onto 2D plane for supervision. To address the lack of dense 3D supervision, MIM4D reconstruct pixels by employing 3D volumetric differentiable rendering to learn geometric representations. We demonstrate that MIM4D achieves state-of-the-art performance on the nuScenes dataset for visual representation learning in autonomous driving. It significantly improves existing methods on multiple downstream tasks, including end-to-end planning (9% collision decrease), BEV segmentation (8.7% IoU), 3D object detection (3.5% mAP), and HD map construction (1.4% mAP). Our work offers a new choice for learning representation at scale in autonomous driving.
conda create -n mim4d python=3.8
conda activate mim4d
conda install -y pytorch==1.13.0 torchvision==0.14.0 cudatoolkit=11.7 -c pytorch
pip install mmcv-full==1.7.1
pip install mmdet==2.28.2 mmsegmentation==0.30.0 tifffile-2023.7.10 numpy==1.19.5 protobuf==4.25.2 scikit-image==0.19.0 pycocotools==2.0.7 nuscenes-devkit==1.1.10 gpustat numba scipy pandas matplotlib Cython shapely loguru tqdm future fire yacs jupyterlab scikit-image pybind11 tensorboardX tensorboard easydict pyyaml open3d addict pyquaternion awscli timm typing-extensions==4.7.1
git clone [email protected]:hustvl/MIM4D.git
cd MIM4D
python setup.py develop --user
Please follow the instruction of UVTR to prepare the dataset.
You can train the model following the instructions. By modifying the following bash file, you can conduct experiments with different settings. You can also find the pretrained models here.
# train
bash ./extra_tools/dist_train_ssl.sh
# test
bash ./extra_tools/dist_test_ssl.sh
NDS | mAP | Model | |
---|---|---|---|
uvtrs_mim4d_vs0.1 | 32.6 | 32.3 | pretrain/ckpt |
uvtrs_mim4d_vs0.075 | 47.0 | 41.4 | pretrain/ckpt |
L2 (m) | Col. (%) | Model | |
---|---|---|---|
VAD_tiny | 0.71 | 0.29 | ckpt |
We build our project based on
Thanks for their great works.
If you find MIM4D is useful in your research or applications, please consider giving us a star 🌟 and citing it by the following BibTeX entry.
@article{zou2025mim4d,
title={Mim4d: Masked modeling with multi-view video for autonomous driving representation learning},
author={Zou, Jialv and Liao, Bencheng and Zhang, Qian and Liu, Wenyu and Wang, Xinggang},
journal={International Journal of Computer Vision},
pages={1--14},
year={2025},
publisher={Springer}
}