Given monocular videos, we build 3D models of articulated objects and environments whose 3D configurations
satisfy dynamics and contact constraints. At its core, our method leverages differentiable physics simulation to
aid visual reconstructions. We couple differentiable physics simulation with differentiable rendering via
coordinate descent, which enables end-to-end optimization of, not only 3D reconstructions, but also physical
system parameters from videos. We demonstrate the effectiveness of physics-informed reconstruction on monocular
videos of quadruped animals and humans. It reduces reconstruction artifacts (e.g., scale ambiguity, unbalanced
poses, and foot swapping) that are challenging to address by visual cues alone, and produces better foot contact
estimation.
In monocular 3D reconstruction, there is a fundamental projective ambiguity, causing multiple plausible
interpretations of the scales between scene elements.
We motivate with an example illustrating the scale ambiguity between
a cat and the background.
Given the video on the left, we use existing methods to reconstruct the
the cat and the background independently, and then compose them with a
relative scale factors ranging from 0.4 to 3.
*We use variants of NeRF for background reconstruction, and BANMo for foreground
reconstruction.
Input Video
0.4x cat scale
1x cat scale
2x cat scale
3x cat scale
Although all the 2D projections align perfectly with the input frame,
their 3D trajecories appears quite different.
Applying a relative scale of 2x produces the most reasonable 3D trajectory.
A bigger scale factor (e.g., 3x) makes the cat appear to be sinking;
a smaller scale factor (e.g., 0.4x) makes the cat appear to be floating.
Below, we visualize the 3D trajectories of the cat with the background.
* We fix the background scale and vary the object scale. The camera trajectory is visualized as colored axes.
0.4x cat scale
1x cat scale
2x cat scale
3x cat scale
Besides scale, one may notice there are more ambiguities about the scene,
For instance, the body poses and the motion in the world frame.
As shown in "Reconstruction (2x)", although with a roughly correct scale,
the cat's feet are touching the ground at some frames (green), it is still floating with a slanted pose at
many other frames (red).
Those errors are confounded with the scale ambiguity, making it hard to correct
even with body and contact priors.
As a fundamental prior governing the dynamics and contact, can physics help us do better?
Method Preview
We aim to find a physically-plausible 3D interpretation of a video
by coupling
visual reconstruction with physics simulation.
On one hand, visual reconstruction provides the starting point and a target trajectory
for physics simulation; on the other hand, physics simulation provides a prior for visual reconstruction.
By alternating between differentiable rendering optimization
and differentiable physics simulation, we can
reach a solution where the reconstruction is physically-plausible.
Consider the input videos on the left, we show simulated motion over optimization cycles:
These are the input video we want to track.
At the first cycle, the physical systems lose balance, because the trajectories provided by visual
reconstruction are infeasible.
After 25 cycles, the visual reconstruction (e.g., the scale) is improved,
allowing the the physical system to follow it.
As the reconstruction of body pose and root body motion becomes better,
the physical system successfully follows the reference video.
Comparisons
Scale and Foot Contact (Fig. 2)
Reference Video
BANMo
BANMo+contact prior
PPR
To augment BANMo results, we apply a simple contact prior (following NeuMan) to find an optimal
relative scale between the cat and the background: the feet should not penerate the ground,
and should touch the ground in at least one frame.
Although contact prior finds a rough scale that makes the feet
touch the ground for some frames (green), the cat's body still appears
floating with a slanted pose at many other frames (red).
Because PPR jointly solves the scale and pose under physics constraints,
it finds a configuration where the contact feet touch the ground.
Body Pose Estimation (Fig. 6)
Reference Video
HuMoR
BANMo
PPR
Ground-truth
We compare PPR with state-of-the-art methods for human motion reconstruction (Fig. 6).
HuMor accurately reconstructs the body pose of the samba sequence, with feet touching the ground.
However, its fails to reconstrcut the foot contact for the bouncing sequence.
For both sequences, BANMo reconstructs slanted body poses and inaccurate foot contact.
With the help of differentiable physics simulation, PPR reconstructs upright body poses and accurate foot
contact.
3D Tracking (Fig. 5)
Reference Video
BANMo
PPR
PPR physics simulation
We compare BANMo and PPR in terms of 3D tracking. BANMo fails to track
the left rear foot (colored tracks) due to heavy occlusion during the walking motion. Note
the track on the left foot is swapped with the right foot occasionally.
Furthermore, it creates an artificial leg to explain the missing tracks on the front leg.
In contrast, PPR tracks the feet well, and does not create an artificial leg.
Because PPR produces physically plausible reconstruction, the captured motion can be
simulated with a physics engine (right).
Comparison with Animal Body Model (Supp. Fig. 2)
BARC fails to reconstruct the sharp ears of the dog, and puts the legs into the wrong positions, while
PPR faithfully reconstructs them.
In the first row, we render reconstructed mesh sequences. PPR works better than
ground-fitting on foot contact estimation.
In the bottom two rows, we visualize the surface reconstruction errors.
Top: Ground-truth. Bottom: Predicted meshes.
The color indicates the Chamfer distance between the ground-truth and the predicted mesh (yellow: large error).
3D Printed Objects
Bibtex
@inproceedings{yang2023ppr,
title={Physically Plausible Reconstruction from Monocular Videos},
author={Yang, Gengshan
and Yang, Shuo
and Zhang, John Z.
and Manchester, Zachary
and Ramanan, Deva},
booktitle = {ICCV},
year={2023},
}
Gengshan Yang is supported by the Qualcomm Innovation Fellowship and CMU Argo AI Center for Autonomous Vehicle
Research.
We thank Tao Chen and Xianyi Cheng for suggestions on simulation tools. We thank Swaminathan Gurumurthy for
feedback on control and Sha Yi for help with 3D printing.
We thank Gautam Gare, Jia Shi, and the anonymous reviewers for helpful comments.