RAC: Reconstructing Animatable Categories from Videos

CVPR 2023

Gengshan Yang Chaoyang Wang N Dinesh Reddy Deva Ramanan

Carnegie Mellon University

Given monocular videos of a deformable object category with a known skeleton topology, we reconstruct a category-level animatable 3D model. Our model factorizes variations across instances (e.g., shape morphology, skeleton dimensions, and texture) from time-specific variations within an instance (e.g., skeleton articulations and elastic shape deformation). Our model allows for motion and morphology transfer over a category.

Abstract

Building animatable 3D models is challenging due to the need for 3D scans, laborious registration, and manual rigging, which are difficult to scale to arbitrary categories. Recently, differentiable rendering provides a pathway to obtain high-quality 3D models from monocular videos, but these are limited to rigid categories or single instances. We present RAC that builds category 3D models from monocular videos while disentangling variations over instances and motion over time. Three key ideas are introduced to solve this problem: (1) specializing a skeleton to instances via optimization, (2) a method for latent space regularization that encourages shared structure across a category while maintaining instance details, and (3) using 3D background models to disentangle objects from the background. We show that 3D models of humans, cats and dogs can be learned from 50-100 internet videos.

[Paper] [Code] [Poster] [Slides]

Video

[Download]

Results: Category Reconstruction

Cat →[More] (76 videos)

We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows).

Dog →[More] (87 videos)

We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows).

Human →[More] (47 videos)

We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows).

Quadruped →[More] (8 videos)

We show reconstructions from camera view point (top rows) and two alternative view points (bottom rows).

Results: Dynamic Scene Reconstruction

RAC represents dynamic scenes as a composition of an object field and a background field. We show videos rendered from the reference view point (top rows, 2nd and 3rd columns) and two alternative view points (middle and bottom rows).

RGB

Normal

Mesh/Skeleton

RGB

Normal

Mesh/Skeleton

Comparisons

[Comparisons on AMA (human)]

Ablations

[Skeleton vs Control Points] [Morphology code β] [Morphology code regularization] [Soft deformation field]

More Results

[Vehicle Reconstruction]

Bibtex

@inproceedings{yang2023rac, title={Reconstructing Animatable Categories from Videos}, author={Yang, Gengshan and Wang, Chaoyang and Reddy, N. Dinesh and Ramanan, Deva}, booktitle = {CVPR}, year={2023} }

Related projects

Deformable shape reconstruction from video(s):
BANMo: Building Animatable 3D Neural Models from Many Casual Videos. CVPR 2022.
ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction. NeurIPS 2021.
LASR: Learning Articulated Shape Reconstruction from a Monocular Video. CVPR 2021.
DOVE: Learning Deformable 3D Objects by Watching Videos. arXiv preprint.
Deformable shape reconstruction from images:
To The Point: Correspondence-driven monocular 3D category reconstruction. NeurIPS 2021.
Self-supervised Single-view 3D Reconstruction via Semantic Consistency. ECCV 2020.
Shape and Viewpoints without Keypoints. ECCV. 2020.
Articulation Aware Canonical Surface Mapping. CVPR 2020.
Learning Category-Specific Mesh Reconstruction from Image Collections. ECCV 2018.

Acknowledgments

Gengshan Yang and N Dinesh Reddy are supported by the Qualcomm Innovation Fellowship. We thank Dashan Gao and Michel Sarkis for suggestions on the project direction. We thank Fernando De la Torre for suggestions on evaluating on human data. We thank Donglai Xiang for providing data and evaluation scripts for MonoClothCap.