Kaiwen Zhang*,
Zhenyu Tang*,
Xiaotao Hu,
Xingang Pan,
Xiaoyang Guo,
Yuan Liu,
Jingwei Huang,
Yuan Li,
Qian Zhang,
Xiaoxiao Long✝,
Xun Cao,
Wei Yin§
*Equal Contribution ✝Project Adviser §Project Lead, Corresponding Author
Versatile capabilities of Epona: Given historical driving context, our Epona can generate consistent minutes-long driving videos at high resolution (A). It can be controlled by diverse trajectories (B), and understand real-world traffic knowledge (C). In addition, our world model can predict future trajectories and serve as an end-to-end real-time motion planner (D).
conda create -n epona python=3.10
conda activate epona
pip install -r requirements.txtTo run the code with CUDA properly, you can comment out torch and torchvision in requirement.txt, and install the appropriate version of torch>=2.1.0+cu121 and torchvision>=0.16.0+cu121 according to the instructions on PyTorch.
Please refer to data preparation for more details to prepare and preprocess data.
After preprocessing, please change the datasets_paths in the config files (under configs folder) to your own data path.
You can first download our pre-trained models (including the world models and the finetuned temporal-aware DCAE) from Huggingface.
In addition to our finetuned temporal-aware DCAE, you may also experiment with the original DCAEs provided by MIT Han Lab as the autoencoder: dc-ae-f32c32-mix-1.0 and dc-ae-f32c32-sana-1.1. After downloading, please change the vae_ckpt in the config files to your own autoencoder checkpoint path.
Then, you can run different scripts in scripts/test folder to test Epona for different uses:
| Script Name | Dataset | Trajectory Type | Video Length | Use Case Description |
|---|---|---|---|---|
test_nuplan.py |
NuPlan | Fixed (from dataset) | Fixed | Evaluation on NuPlan test set with fixed setup. |
test_free.py |
NuPlan | Self-predicted | Variable (free) | Long-term video generation with autonomous predictions. |
test_ctrl.py |
NuPlan | User-provided (poses, yaws) |
Variable (free) | Trajectory-controlled video generation; requires manual inputs in the script. |
test_traj.py |
NuPlan | Prediction only | N/A | Evaluates the model’s trajectory prediction accuracy. |
test_nuscenes.py |
NuScenes | Fixed (from dataset) | Fixed | Evaluation on nuScenes validation set with fixed setup. |
test_demo.py |
Custom input | Self-predicted | Variable (free) | Run Epona on your own input data. |
For example, to test the model on NuPlan test set, you can run:
python3 scripts/test/test_nuplan.py \
--exp_name "test-nuplan" \
--start_id 0 --end_id 100 \
--resume_path "pretrained/epona_nuplan.pkl" \
--config configs/dit_config_dcae_nuplan.pywhere:
exp_nameis the name of the experiment;start_idandend_idare the range of the test samples;resume_pathis the path to the pre-trained world model;configis the path to the config file.
All the inference scripts can be run on a single NVIDIA 4090 GPU.
We also provide a simple script scripts/train_deepspeed.py for you to train or finetune the world model with DeepSpeed.
For example, to train the world model on NuPlan dataset, you can run:
export NODES_NUM=4
export GPUS_NUM=8
torchrun --nnodes=$NODES_NUM --nproc_per_node=$GPUS_NUM \
scripts/train_deepspeed.py \
--batch_size 2 \
--lr 2e-5 \
--exp_name "train-nuplan" \
--config configs/dit_config_dcae_nuplan.py \
--resume_path "pretrained/epona_nuplan.pkl" \ # set `resume_path` to resume training on previous checkpoint
--eval_steps 2000You can customize the configuration file in the configs folder (e.g., adjust image resolution, number of condition frames, model size, etc.).
Additionally, you can finetune our base world model on your own dataset by modifying the dataset folder to implement a custom dataset class.
Our implementation is based on DrivingWorld, Flux and DCAE. Thanks for these great open-source works!
If any part of our paper or code is helpful to your research, please consider citing our work 📝 and give us a star ⭐. Thanks for your support!
@inproceedings{zhang2025epona,
author = {Zhang, Kaiwen and Tang, Zhenyu and Hu, Xiaotao and Pan, Xingang and Guo, Xiaoyang and Liu, Yuan and Huang,
Jingwei and Yuan, Li and Zhang, Qian and Long, Xiao-Xiao and Cao, Xun and Yin, Wei},
title = {Epona: Autoregressive Diffusion World Model for Autonomous Driving},
booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025}
}