Gengshan Yang | Chaoyang Wang | N Dinesh Reddy | Deva Ramanan |
---|
Carnegie Mellon University |
---|
Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging, which are difficult to scale to arbitrary categories. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a skeleton to instances via optimization, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We show that 3D models of humans, cats and dogs can be learned from 50-100 internet videos.
We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows). |
We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows). |
We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows). |
We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows). |
RAC represents dynamic scenes as a composition of an object field and a background field. We show videos rendered from the reference view point (top rows, 2nd and 3rd columns) and two alternative view points (middle and bottom rows). |
RGB | Normal | Mesh/Skeleton | RGB | Normal | Mesh/Skeleton |
@inproceedings{yang2023rac, title={Reconstructing Animatable Categories from Videos}, author={Yang, Gengshan and Wang, Chaoyang and Reddy, N. Dinesh and Ramanan, Deva}, booktitle = {CVPR}, year={2023} }
Deformable shape reconstruction from video(s):
BANMo: Building Animatable 3D Neural Models from Many Casual Videos. CVPR
2022.
ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape
Reconstruction. NeurIPS 2021.
LASR: Learning Articulated Shape Reconstruction from a Monocular Video.
CVPR 2021.
DOVE: Learning Deformable 3D Objects by Watching Videos. arXiv preprint.
Deformable shape reconstruction from images:
To The Point: Correspondence-driven monocular 3D category
reconstruction. NeurIPS 2021.
Self-supervised Single-view 3D Reconstruction via
Semantic Consistency. ECCV 2020.
Shape and Viewpoints without Keypoints. ECCV. 2020.
Articulation Aware Canonical Surface Mapping. CVPR 2020.
Learning Category-Specific Mesh Reconstruction from Image Collections.
ECCV 2018.
Gengshan Yang and N Dinesh Reddy are supported by the Qualcomm Innovation Fellowship. We thank Dashan Gao and Michel Sarkis for suggestions on the project direction. We thank Fernando De la Torre for suggestions on evaluating on human data. We thank Donglai Xiang for providing data and evaluation scripts for MonoClothCap.