Given monocular videos, PPR builds 4D models of the object and the environment whose physical configurations satisfy dynamics and contact constraints.
Gengshan Yang | Shuo Yang | John Z. Zhang | Zachary Manchester | Deva Ramanan |
---|
Carnegie Mellon University |
---|
Given monocular videos, we build 3D models of articulated objects and environments whose 3D configurations satisfy dynamics and contact constraints. At its core, our method leverages differentiable physics simulation to aid visual reconstructions. We couple differentiable physics simulation with differentiable rendering via coordinate descent, which enables end-to-end optimization of, not only 3D reconstructions, but also physical system parameters from videos. We demonstrate the effectiveness of physics-informed reconstruction on monocular videos of quadruped animals and humans. It reduces reconstruction artifacts (e.g., scale ambiguity, unbalanced poses, and foot swapping) that are challenging to address by visual cues alone, and produces better foot contact estimation.
In monocular 3D reconstruction, there is a fundamental projective ambiguity, causing multiple plausible
interpretations of the scales between scene elements.
We motivate with an example illustrating the scale ambiguity between
a cat and the background.
Given the video on the left, we use existing methods to reconstruct the
the cat and the background independently, and then compose them with a
relative scale factors ranging from 0.4 to 3.
*We use variants of NeRF for background reconstruction, and BANMo for foreground
reconstruction.
Input Video | 0.4x cat scale | 1x cat scale | 2x cat scale | 3x cat scale |
Although all the 2D projections align perfectly with the input frame,
their 3D trajecories appears quite different.
Applying a relative scale of 2x produces the most reasonable 3D trajectory.
A bigger scale factor (e.g., 3x) makes the cat appear to be sinking;
a smaller scale factor (e.g., 0.4x) makes the cat appear to be floating.
Below, we visualize the 3D trajectories of the cat with the background.
* We fix the background scale and vary the object scale. The camera trajectory is visualized as colored axes.
0.4x cat scale | 1x cat scale | 2x cat scale | 3x cat scale |
We aim to find a physically-plausible 3D interpretation of a video by coupling visual reconstruction with physics simulation. On one hand, visual reconstruction provides the starting point and a target trajectory for physics simulation; on the other hand, physics simulation provides a prior for visual reconstruction. By alternating between differentiable rendering optimization and differentiable physics simulation, we can reach a solution where the reconstruction is physically-plausible.
Consider the input videos on the left, we show simulated motion over optimization cycles:
Reference Video | BANMo | BANMo+contact prior | PPR |
To augment BANMo results, we apply a simple contact prior (following NeuMan) to find an optimal relative scale between the cat and the background: the feet should not penerate the ground, and should touch the ground in at least one frame. Although contact prior finds a rough scale that makes the feet touch the ground for some frames (green), the cat's body still appears floating with a slanted pose at many other frames (red). Because PPR jointly solves the scale and pose under physics constraints, it finds a configuration where the contact feet touch the ground.
Reference Video | HuMoR | BANMo | PPR | Ground-truth |
We compare PPR with state-of-the-art methods for human motion reconstruction (Fig. 6). HuMor accurately reconstructs the body pose of the samba sequence, with feet touching the ground. However, its fails to reconstrcut the foot contact for the bouncing sequence. For both sequences, BANMo reconstructs slanted body poses and inaccurate foot contact. With the help of differentiable physics simulation, PPR reconstructs upright body poses and accurate foot contact.
Reference Video | BANMo | PPR | PPR physics simulation |
We compare BANMo and PPR in terms of 3D tracking. BANMo fails to track
the left rear foot (colored tracks) due to heavy occlusion during the walking motion. Note
the track on the left foot is swapped with the right foot occasionally.
Furthermore, it creates an artificial leg to explain the missing tracks on the front leg.
In contrast, PPR tracks the feet well, and does not create an artificial leg.
Because PPR produces physically plausible reconstruction, the captured motion can be
simulated with a physics engine (right).
Reference Video | BARC | PPR |
Input Video | Surface Normal | Skeleton/Mesh |
|
|
Full method | Physics → ground-fitting | Multi cycle → one-cycle | PD control → open-loop control | Freeze PD gain and mass |
@inproceedings{yang2023ppr, title={Physically Plausible Reconstruction from Monocular Videos}, author={Yang, Gengshan and Yang, Shuo and Zhang, John Z. and Manchester, Zachary and Ramanan, Deva}, booktitle = {ICCV}, year={2023}, }
Deformable shape reconstruction from video(s):
BANMo: Building Animatable 3D Neural Models from Many Casual Videos.
CVPR 2022.
ViSER: Video-Specific Surface Embeddings for Articulated 3D Shape
Reconstruction. NeurIPS 2021.
LASR: Learning Articulated Shape Reconstruction from a Monocular
Video. CVPR 2021.
DOVE: Learning Deformable 3D Objects by Watching Videos. arXiv
preprint.
Physics-based video human reconstruction:
Differentiable Dynamics for Articulated 3d Human Motion Reconstruction.
CVPR 2022.
Trajectory Optimization for Physics-Based Reconstruction of 3d Human
Pose from Monocular Video. CVPR 2022.
Contact and Human Dynamics from
Monocular Video. ECCV 2020.
PhysCap: Physically Plausible Monocular 3D Motion
Capture in Real Time. SIGGRAPH Asia 2020.
Physics-based video scene reconstruction:
NeuPhysics: Editable Neural Geometry and Physics from Monocular Videos. NeurIPS 2022.
RISP: Rendering-Invariant State Predictor with Differentiable Simulation and Rendering for Cross-Domain
Parameter Estimation. ICLR 2021.
Gengshan Yang is supported by the Qualcomm Innovation Fellowship and CMU Argo AI Center for Autonomous Vehicle Research. The RGBD-pet dataset is adapted from Total-Recon, collected with Chonghyuk Song and Kangle Deng. We thank Tao Chen and Xianyi Cheng for suggestions on simulation tools. We thank Swaminathan Gurumurthy for feedback on control and Sha Yi for help with 3D printing. We thank Gautam Gare, Jia Shi, and the anonymous reviewers for helpful comments.