PPR: Physically Plausible Reconstruction from Monocular Videos

ICCV 2023 - Oral Presentation

Gengshan Yang	Shuo Yang	John Z. Zhang	Zachary Manchester	Deva Ramanan

Carnegie Mellon University

Given monocular videos, PPR builds 4D models of the object and the environment whose physical configurations satisfy dynamics and contact constraints.

Abstract

Given monocular videos, we build 3D models of articulated objects and environments whose 3D configurations satisfy dynamics and contact constraints. At its core, our method leverages differentiable physics simulation to aid visual reconstructions. We couple differentiable physics simulation with differentiable rendering via coordinate descent, which enables end-to-end optimization of, not only 3D reconstructions, but also physical system parameters from videos. We demonstrate the effectiveness of physics-informed reconstruction on monocular videos of quadruped animals and humans. It reduces reconstruction artifacts (e.g., scale ambiguity, unbalanced poses, and foot swapping) that are challenging to address by visual cues alone, and produces better foot contact estimation.

[Paper] [Code] [Poster (2M)]

Video

[Download]

Motivation

In monocular 3D reconstruction, there is a fundamental projective ambiguity, causing multiple plausible interpretations of the scales between scene elements. We motivate with an example illustrating the scale ambiguity between a cat and the background. Given the video on the left, we use existing methods to reconstruct the the cat and the background independently, and then compose them with a relative scale factors ranging from 0.4 to 3.
*We use variants of NeRF for background reconstruction, and BANMo for foreground reconstruction.

Input Video

0.4x cat scale

1x cat scale

2x cat scale

3x cat scale

Although all the 2D projections align perfectly with the input frame, their 3D trajecories appears quite different. Applying a relative scale of 2x produces the most reasonable 3D trajectory. A bigger scale factor (e.g., 3x) makes the cat appear to be sinking; a smaller scale factor (e.g., 0.4x) makes the cat appear to be floating.
Below, we visualize the 3D trajectories of the cat with the background.
* We fix the background scale and vary the object scale. The camera trajectory is visualized as colored axes.

0.4x cat scale

1x cat scale

2x cat scale

3x cat scale

Besides scale, one may notice there are more ambiguities about the scene, For instance, the body poses and the motion in the world frame. As shown in "Reconstruction (2x)", although with a roughly correct scale, the cat's feet are touching the ground at some frames (green), it is still floating with a slanted pose at many other frames (red). Those errors are confounded with the scale ambiguity, making it hard to correct even with body and contact priors.
As a fundamental prior governing the dynamics and contact, can physics help us do better?

Method Preview

We aim to find a physically-plausible 3D interpretation of a video by coupling visual reconstruction with physics simulation. On one hand, visual reconstruction provides the starting point and a target trajectory for physics simulation; on the other hand, physics simulation provides a prior for visual reconstruction. By alternating between differentiable rendering optimization and differentiable physics simulation, we can reach a solution where the reconstruction is physically-plausible.

Consider the input videos on the left, we show simulated motion over optimization cycles:

These are the input video we want to track.

At the first cycle, the physical systems lose balance, because the trajectories provided by visual reconstruction are infeasible.

After 25 cycles, the visual reconstruction (e.g., the scale) is improved, allowing the the physical system to follow it.

As the reconstruction of body pose and root body motion becomes better, the physical system successfully follows the reference video.

Comparisons

Scale and Foot Contact (Fig. 2)

Reference Video

BANMo

BANMo+contact prior

PPR

To augment BANMo results, we apply a simple contact prior (following NeuMan) to find an optimal relative scale between the cat and the background: the feet should not penerate the ground, and should touch the ground in at least one frame. Although contact prior finds a rough scale that makes the feet touch the ground for some frames (green), the cat's body still appears floating with a slanted pose at many other frames (red). Because PPR jointly solves the scale and pose under physics constraints, it finds a configuration where the contact feet touch the ground.

Body Pose Estimation (Fig. 6)

Reference Video

HuMoR

BANMo

PPR

Ground-truth

We compare PPR with state-of-the-art methods for human motion reconstruction (Fig. 6). HuMor accurately reconstructs the body pose of the samba sequence, with feet touching the ground. However, its fails to reconstrcut the foot contact for the bouncing sequence. For both sequences, BANMo reconstructs slanted body poses and inaccurate foot contact. With the help of differentiable physics simulation, PPR reconstructs upright body poses and accurate foot contact.

3D Tracking (Fig. 5)

Reference Video

BANMo

PPR

PPR physics simulation

We compare BANMo and PPR in terms of 3D tracking. BANMo fails to track the left rear foot (colored tracks) due to heavy occlusion during the walking motion. Note the track on the left foot is swapped with the right foot occasionally. Furthermore, it creates an artificial leg to explain the missing tracks on the front leg.
In contrast, PPR tracks the feet well, and does not create an artificial leg. Because PPR produces physically plausible reconstruction, the captured motion can be simulated with a physics engine (right).

Comparison with Animal Body Model (Supp. Fig. 2)

BARC fails to reconstruct the sharp ears of the dog, and puts the legs into the wrong positions, while PPR faithfully reconstructs them.

Reference Video

BARC

PPR

Video Results (Fig. 4)

Casual-Cat →[More]

Casual-Dog →[More]

Casual-Human →[More]

AMA-samba

AMA-bouncing

Novel View Synthesis

Reference View

Fixed Camera View

Top View

Input Video

Surface Normal

Skeleton/Mesh

Ablations (Tab. 3)

Reference Video

Ground-truth

Full method

Physics → ground-fitting

Multi cycle → one-cycle

PD control → open-loop control

Freeze PD gain and mass

In the first row, we render reconstructed mesh sequences. PPR works better than ground-fitting on foot contact estimation.
In the bottom two rows, we visualize the surface reconstruction errors. Top: Ground-truth. Bottom: Predicted meshes. The color indicates the Chamfer distance between the ground-truth and the predicted mesh (yellow: large error).

3D Printed Objects

Bibtex

@inproceedings{yang2023ppr,
	title={Physically Plausible Reconstruction from Monocular Videos},
	author={Yang, Gengshan
	and Yang, Shuo
	and Zhang, John Z.
	and Manchester, Zachary
	and Ramanan, Deva},
	booktitle = {ICCV},
	year={2023},
}

Acknowledgments

Gengshan Yang is supported by the Qualcomm Innovation Fellowship and CMU Argo AI Center for Autonomous Vehicle Research. The RGBD-pet dataset is adapted from Total-Recon, collected with Chonghyuk Song and Kangle Deng. We thank Tao Chen and Xianyi Cheng for suggestions on simulation tools. We thank Swaminathan Gurumurthy for feedback on control and Sha Yi for help with 3D printing. We thank Gautam Gare, Jia Shi, and the anonymous reviewers for helpful comments.

PPR: Physically Plausible Reconstruction from Monocular Videos

Abstract

Video

Motivation

Method Preview

Comparisons

Scale and Foot Contact (Fig. 2)

Body Pose Estimation (Fig. 6)

3D Tracking (Fig. 5)

Comparison with Animal Body Model (Supp. Fig. 2)

Video Results (Fig. 4)

Casual-Cat →[More]

Casual-Dog →[More]

Casual-Human →[More]

AMA-samba

AMA-bouncing

Novel View Synthesis

Reference View

Fixed Camera View

Top View

Ablations (Tab. 3)

3D Printed Objects

Bibtex

Related Papers

Acknowledgments