Skip to content

Eecornwell/gsplat-multi-gpu-guide

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Gaussian Splatting Multi-GPU Guide

A benchmarking guide for gsplat multi-GPU training

Motivation

  • Gaussian splatting is a powerful tool for creating realistic 3D assets, but training times can be prohibitive

    • A typical dataset (300 images at 4k) requires ~1 hour of training on an AWS g5 instance (NVIDIA A10G) at maximum steps (30k)

    Note: The time above does not include any preprocessing or structure from motion (SfM) time which could range from 2-15min based on input provided (video, images, poses, features, etc.)

  • This repository aims to provide comprehensive insights into:

    • Training duration
    • Output quality
    • Dataset characteristics
    • Scaling efficiency
    • Performance metrics using a cloud desktop (see Summary below for exact hardware used for experiments)

Technical Approach

  • Data Parallelism Benefits

    • Gaussian splatting models are compact enough to fit on a single GPU (e.g., NVIDIA A10G)
      • The caveat to this is that GS training is very sensitive to image resolution and number of images
      • This means while increasing resolution, or increasing the number of images, generally speaking the more GPU VRAM is required.
      • Our experiments confirm that 300 4k images can be trained on 24GB VRAM. Increasing image resolution and/or adding more images will require more user testing to ensure the entire trained model can fit on the GPU.
    • Dataset can be efficiently distributed across multiple GPUs
    • Model's small size enables full GPU utilization
  • Multi-node Limitations with GS Models

    • Limited/unknown benefits from multi-node/multi-machine setups at this time. Further tests need to be conducted.
    • It's possible that multi-node training will allow increased resolution, but that is out of scope for the current version of this project.
    • Interconnect latency impacts performance. Further tests need to be conducted.
    • Cost-prohibitive compared to single-node multi-GPU solutions. Further tests need to be conducted.

Repository Features

  • Comprehensive multi-GPU distributed training setup documentation
    • Current data in this repo was produced using a single node/machine, multiple GPUs on a single machine.
  • Benchmark results for various models, focusing on gsplat (Apache-2.0 licensed)
  • Analysis of quality/speed trade-offs for sparse reconstruction scenarios:
    • Small datasets (<200 images)
    • 4K resolution images
    • Based on Mip-NeRF-360 datasets

I. Summary

Hardware used

  • g5.12xlarge
  • If you have an AWS account, use the AWS CloudFormation template here to spin up a full featured Ubuntu GPU desktop with NiceDCV remote display protocol for streaming the desktop

Datasets used

  • Mip-NeRF 360 Dataset Download

    Note: The number of images in the dataset and image resolutions below are the baseline meaning the entire dataset was used at full resolution.

    • Stump

      • Total number of images in dataset
        • 125
      • Image resolution
        • 4978x3300
    • Flowers

      • Total number of images in dataset
        • 173
      • Image resolution
        • 5025x3312

Models Used in Experiment

Metrics Compared

Graph Variations

Note: the results will show steps_scaler baseline and tuned step_scaler based on experiment results

  • Training Time vs. Number of GPUs
  • PSNR vs. Number of GPUs
  • SSIM vs. Number of GPUs (gsplat only)
  • LPIPS vs. Number of GPUs (gsplat only)

II. Results

TL;DR

  • Decreases in Training Time with gsplat

    NOTE: data reflects when tuning step scaler to output baseline quality, see data, graphs, and trend line calculated

    • (QTY 2) GPUs => 55% decrease in training time (2.4x speed-up)
    • (QTY 3) GPUs => 70% decrease in training time (3.5x speed-up)
    • (QTY 4) GPUs => 75% decrease in training time (3.9x speed-up)
  • gsplat has shown to have equal quality than others (Inria and Grendel-GS) at baseline number of steps, but is faster by small amount when using the modified step_scaler equation below.
  • Steps scaler parameter in gsplat does not accurately scale to maintain constant quality when increasing number of GPUs (requires tuning or just use the estimated equation below).
    • Current official gsplat guidance for step_scaler: normal steps_scaler = 1/(num_gpus*batch_size)
    • If using above equation for step_scaler, a steady increase in quality will be observed when increasing the number of GPUs.
      • Therefore, the provided equation below should be used instead for the step scaler in order to maintain constant quality.
      • Our experiments show to use: modified steps_scaler = 0.9576*(num_gpus*batch_size)^(-1.689) based on a trend line of the data we tested (see in Graphs section). This number was empirically found using various step_scaler values that yielded approximately the same quality as with training with one GPU. The actual value will be dataset sensitive based on our experiments.
  • Model does not benefit greatly increasing batch sizes with 4k resolution (sometimes OOM); stick to one batch size unless training 2k resolution or less.
  • Model would not benefit from multi-node/multi-machine training due to small size of model and accumulated interconnect lag

Data



Stump dataset, gsplat (v1.4.0 mcmc), G5.12xlarge (4x, A10G, 24GB)

Dataset Number of Images Image Resolution Data Scale Factor Num GPUs Batch Size Steps Scaler Avg. GPU Utilization Avg. VRAM Memory Usage per active GPU Elapsed Time (min) PSNR SSIM LPIPS
stump 125 4978x3300 1 1 1 1.000 100% 65% 110 26.868 0.8178 0.371
stump 125 4978x3300 1 2 1 0.500 100% 80% 80 27.115 0.823 0.338
stump 125 4978x3300 1 2 1 0.300 100% 80% 51 26.866 0.8216 0.352
stump 125 4978x3300 1 3 1 0.333 100% 80% 67 26.967 0.8226 0.323
stump 125 4978x3300 1 3 1 0.160 100% 80% 35 26.861 0.821 0.345
stump 125 4978x3300 1 4 1 0.250 100% 80% 62 27.136 0.823 0.314
stump 125 4978x3300 1 4 1 0.125 100% 75% 34 26.919 0.8228 0.331


Flowers dataset, gsplat (v1.4.0 mcmc), G5.12xlarge (4x, A10G, 24GB)

Dataset Number of Images Image Resolution Data Scale Factor Num GPUs Batch Size Steps Scaler Avg. GPU Utilization Avg. VRAM Memory Usage per active GPU Elapsed Time (min) PSNR SSIM LPIPS
flowers 173 5025x3312 1 1 1 1.000 100% 60% 126 21.232 0.6062 0.547
flowers 173 5025x3312 1 2 1 0.500 100% 65% 87 21.433 0.6168 0.495
flowers 173 5025x3312 1 2 1 0.250 100% 60% 49 21.272 0.6102 0.525
flowers 173 5025x3312 1 3 1 0.333 100% 55% 70 21.525 0.6209 0.468
flowers 173 5025x3312 1 3 1 0.125 100% 65% 32 21.254 0.6118 0.514
flowers 173 5025x3312 1 4 1 0.250 100% 60% 65 23.001 0.6767 0.426
flowers 173 5025x3312 1 4 1 0.078 100% 65% 27 21.304 0.613 0.509


Sample modified steps_scaler calculation

normal steps_scaler = 1/(num_gpus*batch_size)

modified steps_scaler = 0.9576*(num_gpus*batch_size)^(-1.689)

Number of GPUs Normal steps_scaler Modified steps_scaler
1 1.00 1.00
2 0.50 0.30
3 0.33 0.15
4 0.25 0.09
5 0.20 0.06
6 0.17 0.05
7 0.14 0.04
8 0.13 0.03


Inria (commit=v54c035f), G5.12xlarge (4x, A10G, 24GB)

Dataset Number of Images Image Resolution Data Scale Factor Num GPUs Batch Size Avg. GPU Utilization Avg. VRAM Memory Usage per active GPU Elapsed Time (min) PSNR L1
stump 125 4978x3300 1 1 1 100% 30% 157 27.5 0.027
flowers 173 5025x3312 1 1 1 100% 60% 183 21.328 0.051


Grendel-GS (commit=e5fea1e9), G5.12xlarge (4x, A10G, 24GB)

Dataset Number of Images Image Resolution Data Scale Factor Num GPUs Batch Size Avg. GPU Utilization Avg. VRAM Memory Usage per active GPU Elapsed Time (min) PSNR L1
stump 125 3840x2545 1 1 1 100% 30% 107 27.625 0.026
stump 125 3840x2545 1 2 1 100% 50% 66 27.58 0.026
stump 125 3840x2545 1 3 1 100% 40% 46 27.45 0.026
stump 125 3840x2545 1 4 1 100% 35% 37 27.46 0.026


Grendel-GS (commit=e5fea1e), G5.12xlarge (4x, A10G, 24GB)

Dataset Number of Images Image Resolution Data Scale Factor Num GPUs Batch Size Avg. GPU Utilization Avg. VRAM Memory Usage per active GPU Elapsed Time (min) PSNR L1
flowers 173 5025x3312 1 1 1 100% 85% 225 19.7 0.065
flowers 173 5025x3312 1 2 1 100% 80% 103 19.54 0.067
flowers 173 5025x3312 1 3 1 100% 85% 76 20.445 0.0591
flowers 173 5025x3312 1 4 1 100% 85% 60 20.455 0.0591


Graphs

gsplat with modified steps_scaler parameter

gsplat training time vs. Num GPUs gsplat PSNR vs. Num GPUs gsplat SSIM vs. Num GPUs gsplat LPIPS vs. Num GPUs



Summary: stump

stump training time vs. Num GPUs stump PSNR vs. Num GPUs



Summary: flowers

flowers training time vs. Num GPUs flowers PSNR vs. Num GPUs



Summary: steps_scaler tuning

gsplat tuning the steps scaler

Note: By adjusting the steps_scaler parameter so the quality matches the baseline of one GPU, we can benefit less training time while maintaining quality with multi-gpu training. Use the following formula for setting the gsplat steps_scaler parameter: steps_scaler = 0.9576*(num_gpus)^(-1.689)



III. Reference

Dataset

Installations

  • Base Install using Ubuntu EC2 below

    • Setup Ubuntu EC2 Workstation with base libraries for CUDA and Pytorch

    • Install Conda

      wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
      chmod +x Anaconda3-2022.05-Linux-x86_64.sh
      ./Anaconda3-2022.05-Linux-x86_64.sh
      
      # Follow on-screen instructions
      
      echo 'export PATH="/home/ubuntu/anaconda3/bin:$PATH"' >> ~/.bashrc
      source ~/.bashrc
      conda --version
    • Update Conda

      sudo apt update -y &&
      conda update --all -y &&
      conda install -n base conda-libmamba-solver -y &&
      conda config --set solver libmamba
    • Install Dependencies (torch, cuda 11.8 toolkit, tiny-cuda-nn)

      pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu118 &&
      conda install -c "nvidia/label/cuda-11.8.0" cuda-toolkit -y &&
      sudo apt-get install linux-headers-$(uname -r) &&
      wget https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run &&
      
      chmod +x cuda_11.8.0_520.61.05_linux.run &&
      sudo bash cuda_11.8.0_520.61.05_linux.run  &&
      
      echo "export LD_LIBRARY_PATH=/usr/local/lib:/usr/lib:/usr/local/lib64:/usr/lib64
      export PATH=/usr/local/cuda-11.8/bin:$PATH
      export LD_LIBRARY_PATH=/usr/local/cuda-11.8/lib64:$LD_LIBRARY_PATH" >> ~/.bashrc  &&
      
      sudo reboot now
      
      pip install git+https://github.com/NVlabs/tiny-cuda-nn/#subdirectory=bindings/torch
      
    • Create and Activate Conda Environment

      conda create --name nerfstudio -y python=3.10 &&
      conda activate nerfstudio &&
      pip install --upgrade pip
  • Nerfstudio

    • How to install nerfstudio on Ubuntu 22.04 EC2 from above:
      # Install NerfStudio v1.1.5 from source (optional)
      wget https://github.com/nerfstudio-project/nerfstudio/archive/refs/tags/v1.1.5.tar.gz && \
          tar -xvzf v1.1.5.tar.gz &&
      cd nerfstudio-1.1.5/ &&
      pip install -e .
    • How to run nerfstudio (Doesn't currently work with multi-gpu splatfacto)
      ns-train splatfacto --machine.num-devices 1 --machine.num-machines 1 --machine.machine-rank 0 --machine.dist-url 'tcp://127.0.0.1:23456' nerfstudio-data --data "/mnt/efs/data/stump" --downscale-factor 1
  • gsplat

    • How to install gsplat on Ubuntu 22.04 EC2 from above:
      git clone https://github.com/nerfstudio-project/gsplat.git --recursive &&
      cd gsplat &&
      pip install -e . &&
      pip install -r examples/requirements.txt
    • How to run gsplat
      cd gsplat
      CUDA_VISIBLE_DEVICES=0 python examples/simple_trainer.py mcmc --steps_scaler 1.0 --data_factor 1 --disable_viewer --packed --batch-size 1 --data-dir /mnt/efs/data/stump --result-dir /mnt/efs/data/stump/results/batch1_gpu1

      Note: Adjust CUDA_VISIBLE_DEVICES to number of GPUS (e.g. CUDA_VISIBLE_DEVICES=0,1,2,3) *Note: Adjust steps_scaler based on formulas above (e.g. if two GPUs, steps_scaler=)`

  • Inria-GS

    • How to install inria on Ubuntu EC2:
      git clone https://github.com/graphdeco-inria/gaussian-splatting --recursive
      sudo apt update -y &&
      conda update --all -y &&
      conda install -n base conda-libmamba-solver -y &&
      conda config --set solver libmamba &&
      conda install plyfile tqdm "numpy<2.0.0" &&
      pip install joblib &&
      cd gaussian-splatting &&
      pip install submodules/diff-gaussian-rasterization &&
      pip install submodules/simple-knn &&
      pip install submodules/fused-ssim &&
      pip install opencv-python
    • How to run Inria-GS (only supports one GPU currently)
      python train.py -s /mnt/efs/data/stump -m /mnt/efs/data/stump/results/gpu1_scale4k_125 --eval -r 1
  • Grendel-GS

    • How to install Grendel-GS on Ubuntu EC2:
      git clone https://github.com/nyu-systems/Grendel-GS.git --recursive
      cd Grendel-GS
      conda env create --file environment.yml
      conda activate gaussian_splatting
    • How to run Grendel-GS
      torchrun --standalone --nnodes=1 --nproc-per-node=1 train.py --bsz 1 -s /mnt/efs/data/stump --eval

      Note: Adjust nnodes to number of machines with GPUs Note: Adjust nproc-per-node to number of GPUs per machine

About

gsplat multi-gpu benchmarking and guidance

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published