Top Motivation Method Overview Comparisons Video Results Novel View Rendering Ablations Bibtex

PPR: Physically Plausible Reconstruction from Monocular Videos

ICCV 2023 - Oral Presentation

Gengshan Yang Shuo Yang John Z. Zhang Zachary Manchester Deva Ramanan
Carnegie Mellon University


Given monocular videos, PPR builds 4D models of the object and the environment whose physical configurations satisfy dynamics and contact constraints.

Abstract

Given monocular videos, we build 3D models of articulated objects and environments whose 3D configurations satisfy dynamics and contact constraints. At its core, our method leverages differentiable physics simulation to aid visual reconstructions. We couple differentiable physics simulation with differentiable rendering via coordinate descent, which enables end-to-end optimization of, not only 3D reconstructions, but also physical system parameters from videos. We demonstrate the effectiveness of physics-informed reconstruction on monocular videos of quadruped animals and humans. It reduces reconstruction artifacts (e.g., scale ambiguity, unbalanced poses, and foot swapping) that are challenging to address by visual cues alone, and produces better foot contact estimation.

[Paper] [Code] [Poster (2M)]

Video

[Download]

Motivation

In monocular 3D reconstruction, there is a fundamental projective ambiguity, causing multiple plausible interpretations of the scales between scene elements. We motivate with an example illustrating the scale ambiguity between a cat and the background. Given the video on the left, we use existing methods to reconstruct the the cat and the background independently, and then compose them with a relative scale factors ranging from 0.4 to 3.
*We use variants of NeRF for background reconstruction, and BANMo for foreground reconstruction.

Input Video 0.4x cat scale 1x cat scale 2x cat scale 3x cat scale

Although all the 2D projections align perfectly with the input frame, their 3D trajecories appears quite different. Applying a relative scale of 2x produces the most reasonable 3D trajectory. A bigger scale factor (e.g., 3x) makes the cat appear to be sinking; a smaller scale factor (e.g., 0.4x) makes the cat appear to be floating.
Below, we visualize the 3D trajectories of the cat with the background.
* We fix the background scale and vary the object scale. The camera trajectory is visualized as colored axes.

0.4x cat scale 1x cat scale 2x cat scale 3x cat scale
Besides scale, one may notice there are more ambiguities about the scene, For instance, the body poses and the motion in the world frame. As shown in "Reconstruction (2x)", although with a roughly correct scale, the cat's feet are touching the ground at some frames (green), it is still floating with a slanted pose at many other frames (red). Those errors are confounded with the scale ambiguity, making it hard to correct even with body and contact priors.
As a fundamental prior governing the dynamics and contact, can physics help us do better?

Method Preview

We aim to find a physically-plausible 3D interpretation of a video by coupling visual reconstruction with physics simulation. On one hand, visual reconstruction provides the starting point and a target trajectory for physics simulation; on the other hand, physics simulation provides a prior for visual reconstruction. By alternating between differentiable rendering optimization and differentiable physics simulation, we can reach a solution where the reconstruction is physically-plausible.



Consider the input videos on the left, we show simulated motion over optimization cycles:

These are the input video we want to track.
At the first cycle, the physical systems lose balance, because the trajectories provided by visual reconstruction are infeasible.
After 25 cycles, the visual reconstruction (e.g., the scale) is improved, allowing the the physical system to follow it.
As the reconstruction of body pose and root body motion becomes better, the physical system successfully follows the reference video.

Comparisons

Scale and Foot Contact (Fig. 2)

Reference Video BANMo BANMo+contact prior PPR

To augment BANMo results, we apply a simple contact prior (following NeuMan) to find an optimal relative scale between the cat and the background: the feet should not penerate the ground, and should touch the ground in at least one frame. Although contact prior finds a rough scale that makes the feet touch the ground for some frames (green), the cat's body still appears floating with a slanted pose at many other frames (red). Because PPR jointly solves the scale and pose under physics constraints, it finds a configuration where the contact feet touch the ground.




Body Pose Estimation (Fig. 6)

Reference Video HuMoR BANMo PPR Ground-truth

We compare PPR with state-of-the-art methods for human motion reconstruction (Fig. 6). HuMor accurately reconstructs the body pose of the samba sequence, with feet touching the ground. However, its fails to reconstrcut the foot contact for the bouncing sequence. For both sequences, BANMo reconstructs slanted body poses and inaccurate foot contact. With the help of differentiable physics simulation, PPR reconstructs upright body poses and accurate foot contact.




3D Tracking (Fig. 5)

Reference Video BANMo PPR PPR physics simulation

We compare BANMo and PPR in terms of 3D tracking. BANMo fails to track the left rear foot (colored tracks) due to heavy occlusion during the walking motion. Note the track on the left foot is swapped with the right foot occasionally. Furthermore, it creates an artificial leg to explain the missing tracks on the front leg.
In contrast, PPR tracks the feet well, and does not create an artificial leg. Because PPR produces physically plausible reconstruction, the captured motion can be simulated with a physics engine (right).




Comparison with Animal Body Model (Supp. Fig. 2)

BARC fails to reconstruct the sharp ears of the dog, and puts the legs into the wrong positions, while PPR faithfully reconstructs them.

Reference Video BARC PPR

Video Results (Fig. 4)

Casual-Cat →[More]



Casual-Dog →[More]



Casual-Human →[More]



AMA-samba

AMA-bouncing

Novel View Synthesis

Reference View

Fixed Camera View

Top View

Input Video Surface Normal Skeleton/Mesh

Ablations (Tab. 3)

Reference Video
Ground-truth
Full method Physics → ground-fitting Multi cycle → one-cycle PD control → open-loop control Freeze PD gain and mass
In the first row, we render reconstructed mesh sequences. PPR works better than ground-fitting on foot contact estimation.
In the bottom two rows, we visualize the surface reconstruction errors. Top: Ground-truth. Bottom: Predicted meshes. The color indicates the Chamfer distance between the ground-truth and the predicted mesh (yellow: large error).

3D Printed Objects

Bibtex

@inproceedings{yang2023ppr,
	title={Physically Plausible Reconstruction from Monocular Videos},
	author={Yang, Gengshan
	and Yang, Shuo
	and Zhang, John Z.
	and Manchester, Zachary
	and Ramanan, Deva},
	booktitle = {ICCV},
	year={2023},
}

Related Papers

Deformable shape reconstruction from video(s):
BANMo: Building Animatable 3D Neural Models from Many Casual Videos. CVPR 2022.
ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape Reconstruction. NeurIPS 2021.
LASR: Learning Articulated Shape Reconstruction from a Monocular Video. CVPR 2021.
DOVE: Learning Deformable 3D Objects by Watching Videos. arXiv preprint.
Physics-based video human reconstruction:
Differentiable Dynamics for Articulated 3d Human Motion Reconstruction. CVPR 2022.
Trajectory Optimization for Physics-Based Reconstruction of 3d Human Pose from Monocular Video. CVPR 2022.
Contact and Human Dynamics from Monocular Video. ECCV 2020.
PhysCap: Physically Plausible Monocular 3D Motion Capture in Real Time. SIGGRAPH Asia 2020.
Physics-based video scene reconstruction:
NeuPhysics: Editable Neural Geometry and Physics from Monocular Videos. NeurIPS 2022.
RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain Parameter Estimation. ICLR 2021.

Acknowledgments

Gengshan Yang is supported by the Qualcomm Innovation Fellowship and CMU Argo AI Center for Autonomous Vehicle Research. The RGBD-pet dataset is adapted from Total-Recon, collected with Chonghyuk Song and Kangle Deng. We thank Tao Chen and Xianyi Cheng for suggestions on simulation tools. We thank Swaminathan Gurumurthy for feedback on control and Sha Yi for help with 3D printing. We thank Gautam Gare, Jia Shi, and the anonymous reviewers for helpful comments.

Webpage design borrowed from Peiyun Hu