GitHub - FishWoWater/CAST: An unofficial and simplified implementation of SIGGRAPH 2025 best paper nominate: CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image, working in progress

Unofficial and Simplified Implementation of CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

This repo demonstrates an unofficial basic implementation of SIGGRAPH 2025 Best Paper Nominate CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image.

To simplify the pipeline, almost all modules of the system are based on existing serverless API (e.g.Replicate/Qwen/Kontext/Tripo3D), making it easily deployed on Windows/Linux/MacOS.

Disclaimer

This repo is a proof-of-concept and work-in-progress. Since we don't want to (nor have the computational resources) to train the occlusion-aware generative model and the iterative registration approach, its performance is NEVER expected to approach the performance of the original paper.

Algorithmic Workflow

Demo (Qwen as the image inpainter and TRELLIS as the 3D Generator)

You could try Tripo3D for higher quality 3D models

Differences with the Paper

Instead of using florence2 to first detect out objects in the image and then use Grounded-SAM to generate masks, here we use IDEA's RAM model to predict boxes/tags/masks all at once.
The original paper puts a lot of efforts in training an occlusion-aware 3D generative model to support the mesh generation of seriously occluded objects in the image, here we use a simplified and training-free scheme: use Flux-Kontext / Qwen-Image to inpaint the occluded image, and then use off-the-shelf 3D generators like Tripo3D and TRELLIS to generate 3D models.
The original paper implements a point cloud conditioned model and iterative approach to register the 6D pose of an object. Here as we don't have the computational power, again we resort to ICP(Iterative Closest Point) or Differentiable Rendering to optimize the pose/scale of generated objects.

Detailed differences are summarized in the following table:

Modules	Original Paper	This Repo
Detection and Caption	Florence2	RAM-Grounded-SAM
Segmentation	Grounded-SAM	RAM-Grounded-SAM
Detection Filtering	GPT-4	Qwen-VL
Depth Estimation & PointCloud	MoGev1	MoGev2
Mesh Generation	Occlusion-Aware self-trained 3D Generative model	Kontext/Qwen + Tripo3D/TRELLIS
Pose Registration	Occlusion-Aaware self-trained 3D Generative model	ICP / DR
Physical Post-Processing	Scene Graph Guided SDF	TO BE DONE

Quick Start

1. Setup

# some asyncio support of replicate requires Python3.11 or higher version
conda create -n cast python=3.11 -y
conda activte cast 
python install_setup.py 

# setup env vars for replicate / qwen / tripo
cp .env.example .env

2. Inference

# check available parameters 
python -m cast.cli -h 
# usage: cli.py [-h] (--image IMAGE | --batch BATCH) [--output OUTPUT] [--no-intermediates] [--run-id RUN_ID] s[--no-resume] [--num-max-objects NUM_MAX_OBJECTS] [--validate-only] [--visualize] [--enable-generation] [--generation-threshold {no_occlusion,some_occlusion,severe_occlusion}] [--discard-threshold {no_occlusion,some_occlusion,severe_occlusion}] [--generation-provider {replicate,qwen}] [--mesh-provider {tripo3d,trellis}] [--mesh-base-url MESH_BASE_URL] [--pose-estimation-backend {icp,pytorch}] [--debug]

# inference example
python -m cast.cli -i assets/indoor.png --output outputs --enable-generation --pose-estimation-backend icp --generation-provider qwen --mesh-provider trellis

Experiments Working in Progress (works or don't)

1. Inpainting-Anything / LAMA : NOT WORK

Maybe because of serious occlusion or the strange silhouette of the occlusion mask, inpainting methods like Lama and Inpainting-Anything fail at generating plausible predictions (as shown below). Instead we use Qwen-Image / Flux-Kontext, the more powerful and general generation framework.

2. Differentiable Rendering

Working in Progress, may refer to this thread for reference.

TODO

Further experiment with the differentiable rendering
Implement the scene-graph and SDF guided object pose estimation.
Test More Use Cases

Comparison with Existing Works

Basically there are two roads to get a component-level reconstruction(generation) from a RGB image: End-to-End or procedural long pipelines.

MIDI-3D is a promising approach, but it's trained on 3D-Front and the generalization to outdoor scene / objects NOT guaranteed, besides the 6D pose of generated objects often not aligns well with the image.
PartCrafter / PartPacker is basically object-centric, although PartCrafter has a model trained on 3DFront, an indoor scene dataset.
ArtiScene is a promising procedural approach
ReconViaGen is object-centric and requires multi-view images as the input, which is impossible for wide-angle or large scale scenes.

Any discussion and contributions are welcomed, need your help :)

LICENSE

See License, this work is licensed under MIT.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
cast		cast
docs		docs
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
install_setup.py		install_setup.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Unofficial and Simplified Implementation of CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

Disclaimer

Algorithmic Workflow

Demo (Qwen as the image inpainter and TRELLIS as the 3D Generator)

Differences with the Paper

Quick Start

1. Setup

2. Inference

Experiments Working in Progress (works or don't)

1. Inpainting-Anything / LAMA : NOT WORK

2. Differentiable Rendering

TODO

Comparison with Existing Works

LICENSE

About

Uh oh!

Releases

Packages

Languages

License

FishWoWater/CAST

Folders and files

Latest commit

History

Repository files navigation

Unofficial and Simplified Implementation of CAST: Component-Aligned 3D Scene Reconstruction from an RGB Image

Disclaimer

Algorithmic Workflow

Demo (Qwen as the image inpainter and TRELLIS as the 3D Generator)

Differences with the Paper

Quick Start

1. Setup

2. Inference

Experiments Working in Progress (works or don't)

1. Inpainting-Anything / LAMA : NOT WORK

2. Differentiable Rendering

TODO

Comparison with Existing Works

LICENSE

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages