Skip to content

Czm369/bev-vae

Repository files navigation

Scalable and Generalizable Autonomous Driving Scene Synthesis

Zeming Chen, Hang Zhao.

Abstract

TL; DR We introduce BEV-VAE, a variational autoencoder that unifies multi-view images into a BEV representation for scalable and generalizable autonomous driving scene synthesis. Generative modeling has shown remarkable success in vision and language, inspiring research on synthesizing autonomous driving scenes. Existing multi-view synthesis approaches commonly operate in image latent spaces with cross-attention to enforce spatial consistency, but they are tightly bound to camera configurations, which limits dataset scalability and model generalization. We propose BEV-VAE, a variational autoencoder that unifies multi-view images into a compact bird’s-eye-view (BEV) representation, enabling encoding from arbitrary camera layouts and decoding to any desired viewpoint. Through multi-view image reconstruction and novel view synthesis, we show that BEV-VAE effectively fuses multi-view information and accurately models spatial structure. This capability allows it to generalize across camera configurations and facilitates scalable training on diverse datasets. Within the latent space of BEV-VAE, a Diffusion Transformer (DiT) generates BEV representations conditioned on 3D object layouts, enabling multi-view image synthesis with enhanced spatial consistency on nuScenes and achieving the first complete seven-view synthesis on AV2. Finally, synthesized imagery significantly improves the perception performance of BEVFormer, highlighting the utility of scalable and generalizable scene synthesis for autonomous driving.

Method

Overall architecture of BEV-VAE with DiT for multi-view image generation.

framework In Stage 1, BEV-VAE learns to encode multi-view images into a spatially compact latent space in BEV and reconstruct them, ensuring spatial consistency. In Stage 2, DiT is trained with Classifier-Free Guidance (CFG) in this latent space to generate BEV representations from random noise, which are then decoded into multi-view images.

Experiments

Datasets

This study uses four multi-camera autonomous driving datasets that differ substantially in scale, camera configuration, annotated categories, and recording locations. Despite these differences, all datasets provide full 360° coverage of the surrounding scene.

Dataset #Frames #Cameras #Classes Recording Locations
WS101 17k 5 0 London, San Francisco Bay Area
nuScenes 155k 6 23 Boston, Pittsburgh, Las Vegas, Singapore
AV2 224k 7 30 Austin, Detroit, Miami, Pittsburgh, Palo Alto, Washington DC
nuPlan 3.11M 8 7 Boston, Pittsburgh, Las Vegas, Singapore

We introduce a new hybrid autonomous driving dataset configuration, PAS, which combines nuPlan, AV2, and nuScenes.

Multi-view Image Reconstruction

BEV-VAE learns unified BEV representations by reconstructing multi-view images, integrating semantics from all camera views while modeling 3D spatial structure. Reconstruction metrics provide an indirect evaluation of the quality of the learned BEV representations. For reference, we compare with SD-VAE, a foundational model trained on LAION-5B, which encodes a single $256\times256$ image into a $32 \times32\times4$ latent. In contrast, BEV-VAE encodes multiple $256\times256$ views into a $32\times32\times16$ BEV latent, facing the more challenging task of modeling underlying 3D structure.

Reconstruction metrics on nuScenes compared with SD-VAE.

Model Training Validation PSNR $\uparrow$ SSIM $\uparrow$ MVSC $\uparrow$ rFID $\downarrow$
SD-VAE LAION-5B nuScenes 29.63 0.8283 0.9292 2.18
BEV-VAE nuScenes nuScenes 26.13 0.7231 0.9250 6.66
BEV-VAE PAS nuScenes 28.88 0.8028 0.9756 4.74

Reconstruction metrics on AV2 compared with SD-VAE.

Model Training Validation PSNR $\uparrow$ SSIM $\uparrow$ MVSC $\uparrow$ rFID $\downarrow$
SD-VAE LAION-5B AV2 27.81 0.8229 0.8962 1.87
BEV-VAE AV2 AV2 26.02 0.7651 0.9197 4.15
BEV-VAE PAS AV2 27.29 0.8028 0.9461 2.82

SD-VAE focuses on per-view image fidelity, whereas PAS-trained BEV-VAE achieves superior multi-view spatial consistency (MVSC).

Multi-view image reconstruction on nuScenes

Click the image below to watch the ego view rotate 360° horizontally. rec_nusc

Multi-view image reconstruction on AV2

Click the image below to watch the ego view rotate 360° horizontally. rec_av2

Multi-view image reconstruction on nuPlan

Click the image below to watch the ego view rotate 360° horizontally. rec_nupl

Novel View Synthesis

rec_rot Novel view synthesis via camera pose modifications on nuScenes. Row 1 shows real images from the nuScenes validation set, and Rows 2-3 show reconstructions with all cameras rotated 30° left and right, where the cement truck and tower crane truck remain consistent across views without deformation.

nvs_cam Novel view synthesis cross camera configurations. Row 1 presents real images from the nuPlan validation set. Row 2 and Row 3 show reconstructions using camera parameters from AV2 and nuScenes, respectively. The model captures dataset-specific vehicle priors: AV2 reconstructions include both the front and rear of the ego vehicle, while nuScenes reconstructions mainly show the rear (with the rightmost image corresponding to the rear-view camera for alignment).

Zero-shot BEV Representation Construction

rec_ws101 Zero-shot BEV representation construction on WS101. Row 1 shows real images from the WS101 validation set. Rows 2 and 3 show zero-shot and fine-tuned reconstructions, respectively, with object shapes preserved in the zero-shot results and further sharpened after fine-tuning.

Model Training Validation PSNR $\uparrow$ SSIM $\uparrow$ MVSC $\uparrow$ rFID $\downarrow$
SD-VAE LAION-5B WS101 23.38 0.7050 0.8580 4.59
BEV-VAE PAS WS101 16.6 0.3998 0.8309 56.7
BEV-VAE PAS + WS101 WS101 23.46 0.6844 0.9505 13.78

Zero-shot and fine-tuned reconstruction metrics on WS101 compared with SD-VAE.

Autonomous Driving Scene Synthesis

Autonomous driving scene synthesis from AV2 to nuScenes.

gen_av2 BEV-VAE with DiT generates a BEV representation from 3D bounding boxes of AV2, which can then be decoded into multi-view images according to the camera configurations of nuScenes.

Multi-view image generation on AV2 with 3D object layout editing.

Click the image below to watch the ego view rotate 360° horizontally. gen_drop_av2

Multi-view image generation on nuScenes with 3D object layout editing.

Click the image below to watch the ego view rotate 360° horizontally. gen_deop_nusc

Data Augmentation for Perception

BEV-VAE w/ DiT using the Historical Frame Replacement strategy (randomly replacing real frames with generated ones) improves BEVFormer’s perception by enabling the model to learn invariance of object locations relative to appearance.

Perception Model Generative Model Augmentation Strategy mAP$\uparrow$ NDS$\uparrow$
BEVFormer Tiny - - 25.2 35.4
BEVFormer Tiny BEVGen Training Set + 6k Synthetic Data 27.3 37.2
BEVFormer Tiny BEV-VAE w/ DiT Historical Frame Replacement 27.1 37.4

TODO

  • releasing the paper
  • tutorial
  • pretrained weight for stage 1&2
  • inference code
  • train code

About

Scalable and Generalizable Autonomous Driving Scene Synthesis

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published