This repository contains the VASA implementation separated from EMOPortraits, with all components properly configured for standalone training.
- Clean separation of VASA motion generation from EMOPortraits volumetric rendering
- Bridge interface for easy swapping of volumetric avatar backends
- XY/UV warping system for expression transfer and canonical view generation
- Efficient caching with single-bucket preprocessing
- Multi-mode training support (overfitting, full dataset)
- MCP Server Setup (for Claude integration):
# Add Weights & Biases MCP server for Claude
claude mcp add wandb -- uvx --from git+https://github.com/wandb/wandb-mcp-server wandb_mcp_server && uvx wandb login
- Clone the repository with submodules:
# Clone with submodules included
git clone --recurse-submodules https://github.com/johndpope/VASA-1-hack.git
cd VASA-1-hack
# Or if you already cloned without submodules:
git submodule update --init --recursive
# Install system dependencies
sudo apt-get update
sudo apt-get install -y ffmpeg git-lfs
mkdir -p ~/miniconda3
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -O ~/miniconda.sh
chmod +x ~/miniconda.sh
~/miniconda.sh
# carefully accept - type yes -
# Create conda environment
conda create -n vasa python=3.12
conda activate vasa
# Install PyTorch (adjust for your CUDA version)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu129
# Install required packages
pip install omegaconf wandb opencv-python-headless pillow scipy matplotlib tqdm
pip install transformers diffusers accelerate einops
pip install facenet-pytorch insightface hsemotion-onnx
pip install mediapipe OmegaConf wandb
pip install memory-profiler rich
pip install diffusers h5py scikit-learn seaborn python_speech_features
pip install onnxruntime-gpu lpips pytorch_msssim
# EMOPortaits
cd nemo
chmod +x ./bootstrap.sh
./bootstrap.sh
- Create necessary symlinks:
# Create symlink for repos (required for relative paths)
ln -s nemo/repos repos
# Create symlink for data directory (required for aligned keypoints)
ln -s nemo/data data
# Create symlink for losses directory (required for loss model weights)
ln -s nemo/losses losses
- Download pre-trained volumetric avatar model:
The pre-trained model should be placed in:
nemo/logs/Retrain_with_17_V1_New_rand_MM_SEC_4_drop_02_stm_10_CV_05_1_1/checkpoints/328_model.pth
- Prepare your training data:
# Create directories
mkdir -p junk cache checkpoints
# Place your training videos in the junk directory
# Videos should be .mp4 format
cp your_training_videos/*.mp4 junk/
VASA-1-hack/
├── nemo/ # Git submodule: nemo repository (base EMOPortraits code)
│ ├── models/ # Model implementations
│ ├── networks/ # Network architectures
│ ├── losses/ # Loss functions
│ ├── datasets/ # Dataset loaders
│ ├── repos/ # External repositories (face_par_off, etc.)
│ └── logs/ # Pre-trained model checkpoints
│
├── vasa_*.py # VASA-specific implementations
│ ├── vasa_trainer.py # Main training script
│ ├── vasa_model.py # VASA model architecture
│ ├── vasa_dataset.py # VASA dataset handler
│ ├── vasa_scheduler.py # Diffusion scheduler
│ └── vasa_lip_normalizer.py # Lip normalization utilities
│
├── vasa_config.yaml # Main configuration file
├── video_tracker.py # Video tracking utilities
├── syncnet.py # Sync network implementation
│
├── data/ # Data files
│ └── aligned_keypoints_3d.npy
├── losses/ # Loss model weights
│ └── loss_model_weights/
├── junk/ # Training videos directory
├── cache/ # Cache for processed data
├── checkpoints/ # Model checkpoints
└── repos/ # Symlink to nemo/repos
Edit vasa_config.yaml
to configure paths and training parameters:
paths:
volumetric_model: "nemo/logs/[...]/328_model.pth" # Pre-trained model
volumetric_config: "nemo/models/stage_1/volumetric_avatar/va.yaml"
data_dir: "data"
video_folder: "junk" # Your training videos directory
cache_dir: "cache"
checkpoint_dir: "checkpoints"
train:
batch_size: 1
num_epochs: 4000
lr: 1e-3
# ... other training parameters
python test_vasa_setup.py
Expected output:
✓ Config loaded successfully
✓ All paths exist
✓ All modules import correctly
✓ Setup looks good! You can now run vasa_trainer.py
Test your setup and verify model can train properly:
# Run overfitting test with optimized settings
python train_overfit.py
This uses overfit_config.yaml
with:
- Single-bucket caching for fast data loading
- Face attribute caching (gaze, emotion, head_distance)
- Optimized batch sizes and learning rates
- WandB integration for monitoring
- Automatic checkpoint resumption
Use the standard configuration for training on your complete dataset:
# Uses vasa_config.yaml by default
python vasa_trainer.py
# Or explicitly specify the config
python vasa_trainer.py --config vasa_config.yaml
Key parameters in vasa_config.yaml
:
window_size: 50
- Full 50-frame windowsn_layers: 8
- Full 8 transformer layersnum_steps: 1000
- Full 1000 diffusion stepsbatch_size: 1
- Adjust based on GPU memorynum_epochs: 4000
- Full training schedule
Use the overfitting configuration via vasa_trainer:
# Use the overfitting configuration with vasa_trainer
python vasa_trainer.py --config overfit_config.yaml
Key differences in overfit_config.yaml
:
window_size: 20
- Smaller windows for faster processingn_layers: 2
- Reduced transformer depth (2x-4x faster)num_steps: 100
- Reduced diffusion steps (10x faster)batch_size: 4
- Larger batch for better GPU utilizationnum_epochs: 100
- Shorter training for quick iterationmax_videos: 100
- Limited dataset sizenum_workers: 8
- Multi-threaded data loading- No augmentation - Pure overfitting test
When to use overfitting mode:
- Testing new model architectures
- Debugging training pipeline
- Verifying data loading and caching
- Quick convergence tests
- Checking if model can overfit to small dataset (sanity check)
For faster training, preprocess all windows into a single cache file:
# Preprocess data for overfitting test (small dataset)
python preprocess_single_bucket.py --max_videos 100 --cache_dir cache_overfit
# Preprocess full dataset
python preprocess_single_bucket.py --max_videos 1000 --cache_dir cache_full
Benefits of single-bucket caching:
- 10x faster data loading - Direct index access to any window
- Face attributes cached - Gaze, emotion, head_distance pre-computed
- Better shuffling - Perfect for random sampling
- Memory efficient - One H5 file instead of many
- Self-contained windows - Context is cached, no video dependencies
The cache will be automatically used if:
use_single_bucket: true
in your config file- The cache file exists in the specified
cache_dir
Both training modes support WandB logging:
# View training progress
# Visit the URL printed at training start, e.g.:
# wandb: 🚀 View run at https://wandb.ai/your-username/vasa/runs/run-id
For overfitting mode, runs are grouped as "overfit-experiments" in WandB for easy comparison.
To use a different dataset (e.g., CelebV-HQ):
# Edit the config file or create a custom one
# Update video_folder path in the config:
# video_folder: "/path/to/your/dataset"
# For example, using CelebV-HQ:
# video_folder: "/media/12TB/Downloads/CelebV-HQ/celebvhq/35666"
The trainer will:
- Load the pre-trained volumetric avatar model
- Process videos from the configured directory
- Cache processed windows for faster subsequent epochs
- Save checkpoints periodically based on
save_freq
- Save checkpoints to
checkpoints/
(orcheckpoints_overfit/
for overfitting mode) - Log to Weights & Biases (if enabled)
Parameter | Vanilla Training | Overfitting Mode | Speedup |
---|---|---|---|
Window Size | 50 frames | 20 frames | 2.5x |
Transformer Layers | 8 | 2 | 4x |
Diffusion Steps | 1000 | 100 | 10x |
Batch Size | 1 | 4 | 4x |
Workers | 0 | 8 | Parallel loading |
Epoch Time (RTX 5090) | ~5 min | ~1.5 min | 3.3x |
The project includes several debugging pipelines for analyzing face swap and identity preservation issues:
# Test with video (uses joint extraction to prevent identity drift)
python nemo/pipeline3.py --target nemo/data/VID_1.mp4 --max-frames 10
# Test with single image
python nemo/pipeline3.py --target nemo/data/IMG_2.png
# Use custom source identity
python nemo/pipeline3.py --source path/to/source.png --target path/to/target.mp4
# Swap identity mode (use driver's identity with source's expression)
python nemo/pipeline3.py --default-video --swap-identity
# This is useful when the model is extracting the wrong identity
Features:
- Joint extraction: Processes source+first_driver_frame together to calibrate embeddings
- Identity swapping:
--swap-identity
flag to use driver's identity with source's expression - Comprehensive tracing: Every step logged with images and tensors
- Comparison grids: Side-by-side visualization of results
- Warp visualization: XY/UV warp magnitude heatmaps
- Debug output: All intermediates saved to
debug_pipeline3/
# The reference pipeline that produces correct results
python nemo/pipeline2.py
This is the baseline implementation that pipeline3.py was designed to match.
Various analysis scripts for specific debugging:
check_identity_confusion.py
- Analyze identity preservationdebug_identity_extraction.py
- Test identity feature extractiontest_polished_face_swap.py
- Test face swap qualityextract_and_apply_warps_properly.py
- Analyze warp field application
The volumetric avatar system uses two types of warps:
-
XY Warps (Rigid + Non-rigid 3D warping)
- Transform from posed face → canonical (neutral) space
- Removes head pose and expression from source
- Creates identity-preserving canonical volume
-
UV Warps (Expression transfer)
- Transform from canonical → target expression
- Applies target's expression and pose
- Preserves source identity while adopting target motion
Problem: Generated face morphs away from source identity Cause: Solo extraction (processing source alone without driver context) Solution: Joint extraction - process source+first_driver_frame together
Problem: Male faces (e.g., IMG_1.png) appear feminine in results Cause: Identity embeddings not properly calibrated to driver motion space Solution: Joint extraction ensures embeddings are aligned with driver poses
debug_pipeline3/
├── trace_YYYYMMDD_HHMMSS.json # Complete execution trace
├── step_NNNN_*.png # Intermediate images at each step
├── step_NNNN_*.pt # Tensor checkpoints
├── frame_NNN_result.png # Final output frames
└── video_comparison.png # Grid comparison of all frames
The trace files contain detailed information about each processing step:
- Entry/exit points for all major functions
- Tensor shapes and statistics
- Mask generation and compositing steps
- Warp field generation and application
Use the trace to identify where identity drift or other issues occur in the pipeline. | Convergence | 1000+ epochs | 10-20 epochs | 50x+ |
The VASA model uses a sophisticated two-stage warping system to separate identity from expression, enabling clean expression transfer between faces.
- Coordinate System: XY refers to spatial coordinates (X=width, Y=height) in the 3D volume space (16×64×64 grid)
- Direction: FROM current expression → TO canonical (neutral)
- Purpose: Expression normalization - removes the current expression to get back to a neutral state
- Effect: "Undoes" expressions (e.g., moves smiling mouth corners back to neutral positions)
- Applied to: The source volume before any target expression is added
- Coordinate System: UV uses texture/surface coordinates (0-1 normalized range)
- Direction: FROM canonical → TO target expression
- Purpose: Expression application - adds the desired expression to the neutral volume
- Effect: Deforms canonical volume to create new expressions (smile, frown, surprise, etc.)
- Applied to: The volume after XY warping (canonical state)
Source Face (😊) → [XY Warp] → Canonical (😐) → [UV Warp] → Target Face (😮)
- Stage 1 (XY Warping): Normalizes any expression to canonical
- Stage 2 (UV Warping): Applies target expression to canonical
This separation enables:
- Clean expression transfer between any source and target
- Identity preservation while changing expressions
- Consistent canonical representation for all faces
The warps are extracted during dataset preprocessing:
# In vasa_dataset.py - extract warps for training
motion_data = {
'xy_warps': xy_warps, # [T, 16, 64, 64, 3] - normalizes to canonical
'rigid_warps': rigid_warps, # [T, 16, 64, 64, 3] - head pose alignment
'uv_warps': uv_warps, # [T, 16, 64, 64, 3] - applies target expression
'source_theta': thetas # [T, 3, 4] - pose matrices
}
To cleanly separate VASA from the volumetric avatar implementation, we've developed a bridge interface that abstracts all EMOPortraits-specific details.
Abstract interface that any volumetric avatar backend must implement:
class VolumetricAvatarBridgeInterface:
def extract_warps_for_window(frames, identity_frame_idx) -> WindowWarpData
def extract_warps_for_frame(identity_frame, target_frame) -> FrameWarpData
def generate_canonical_view(identity_frame) -> canonical_image
def get_identity_embedding(identity_frame) -> identity_embed
Concrete implementation for EMOPortraits/MegaPortraits models:
- Handles all model-specific details internally
- Provides clean warp extraction API
- Manages caching for efficiency
- Supports batch processing for entire windows
from vasa_emo_bridge_interface import create_bridge
# Create bridge (abstracts all EMO details)
bridge = create_bridge("emoportraits", emo_model)
# Extract warps for entire window at once
window_warps = bridge.extract_warps_for_window(
frames=frames, # [T, C, H, W]
identity_frame_idx=0 # Use first frame as identity
)
# Access extracted warps
xy_warps = window_warps.xy_warps # [T, D, H, W, 3]
rigid_warps = window_warps.rigid_warps # [T, D, H, W, 3]
uv_warps = window_warps.uv_warps # [T, D, H, W, 3]
# Generate canonical view
canonical = bridge.generate_canonical_view(identity_frame)
- Clean Separation: VASA code doesn't need to know EMOPortraits internals
- Easy Swapping: Can replace volumetric backend without changing VASA
- Batch Efficiency: Process entire windows at once
- Automatic Caching: Identity embeddings cached automatically
- Type Safety: Clear data structures with type hints
The system can generate canonical (neutral, front-facing) views from any input expression:
A canonical view represents a person in a standardized state:
- Neutral expression (no smile, closed mouth)
- Front-facing pose (no head rotation)
- Consistent lighting and appearance
- Extract identity embedding from the source frame
- Create canonical pose (identity matrix = no rotation)
- Process through volumetric model to get canonical volume
- Decode with minimal warping to get neutral view
- Reference frame generation for consistent motion synthesis
- Expression normalization for training
- Identity preservation during expression transfer
- Quality evaluation of the volumetric model
When given different expressions as input, the canonical generation produces nearly identical neutral views:
- Average difference between canonical views: < 0.1 (excellent consistency)
- Identity fully preserved
- All expressions normalized to neutral
The project uses Python's logging module with three configurable levels defined in nemo/logger.py:28-30
:
# log_level = logging.WARNING # Minimal output - only warnings and errors
log_level = logging.INFO # Standard output - informational messages (default)
# log_level = logging.DEBUG # Verbose output - detailed debugging information
Logging Levels Explained:
-
WARNING (
logging.WARNING
)- Shows only warnings, errors, and critical messages
- Use when you want minimal console output during training
- Best for production runs where you only need to know about issues
-
INFO (
logging.INFO
) - Currently Active- Shows informational messages, warnings, and errors
- Provides training progress, epoch updates, and key metrics
- Default and recommended level for normal training runs
- Balances visibility with readability
-
DEBUG (
logging.DEBUG
)- Shows all messages including detailed debugging information
- Includes tensor shapes, gradient information, and internal state
- Use when troubleshooting model issues or understanding data flow
- Can be verbose - recommended only for debugging sessions
To change the logging level:
- Edit
nemo/logger.py
line 29 - Uncomment the desired level and comment out the others
- The change takes effect on next run
Additional Features:
- Logs are saved to
project.log
file for later review - Rich formatting with color-coded output and timestamps
- Third-party library logging is suppressed to reduce noise
- TorchDebugger class available for advanced PyTorch debugging
-
ModuleNotFoundError: No module named 'logger'
# The logger module is in nemo, paths are already configured # If still having issues, check that nemo is cloned properly
-
FileNotFoundError: './repos/face_par_off/res/cp/79999_iter.pth'
# Ensure the symlink exists: ln -s nemo/repos repos
-
ValueError: num_samples should be a positive integer value, but got num_samples=0
# No videos found. Add videos to junk/ directory: cp your_video.mp4 junk/
-
FileNotFoundError: Config file not found at channel_config.yaml
# Copy from EMOPortraits or create a basic one
-
CUDA out of memory
- Reduce
batch_size
in vasa_config.yaml - Enable gradient checkpointing
- Reduce
sequence_length
in dataset config
- Reduce
-
FFmpeg warnings
- These can be safely ignored if not processing audio
- To fix:
pip install ffmpeg-python
If you're missing files, you'll need these from EMOPortraits:
channel_config.yaml
- Channel configurationsyncnet.py
- Sync network implementationdata/aligned_keypoints_3d.npy
- 3D keypoint alignmentslosses/loss_model_weights/*.pth
- Pre-trained loss models- Pre-trained volumetric avatar checkpoint
Training progress is logged to:
- Console: Real-time training metrics
- Weights & Biases: Detailed metrics and visualizations (if enabled)
- Checkpoints: Saved every N epochs to
checkpoints/
Monitor training:
# Watch training logs
tail -f project.log
# Check W&B dashboard
# https://wandb.ai/YOUR_USERNAME/vasa/
- VASA-specific code: Root directory (
vasa_*.py
) - Base EMOPortraits code:
nemo/
directory - Configuration:
vasa_config.yaml
- Training data:
junk/
directory - Model outputs:
checkpoints/
directory
- Separated VASA components from EMOPortraits codebase
- Fixed all hardcoded paths to be relative or configurable
- Proper module imports with sys.path management
- Configurable paths via vasa_config.yaml
- Auto-detection of project directories in nemo code
- Clean separation between VASA-specific and base code
Update nemo to latest version:
cd nemo
git pull origin main
cd ..
git add nemo
git commit -m "Update nemo submodule to latest"
Lock to specific nemo version:
cd nemo
git checkout <commit-hash>
cd ..
git add nemo
git commit -m "Lock nemo to specific version"
- The volumetric model must be pre-trained (from EMOPortraits)
- Training requires at least one video in the
junk/
directory - All paths in configs are relative to the project root
- The
repos
symlink is required for backward compatibility
- Training requires significant GPU memory (recommended: 24GB+)
- Some imports show FFmpeg warnings (can be ignored)
- Initial dataset processing can be slow (cached afterward)
This project is licensed under the MIT License - see the LICENSE file for details.
Note: The nemo submodule and other dependencies may have their own licenses.
- EMOPortraits team for the base implementation
- VASA paper authors for the architecture design
- Contributors to the nemo repository