I'm a researcher at World Labs. I completed my PhD from CMU Robotics under the guidance of Prof. Deva Ramanan. Here is my PhD Thesis.
Google Scholar  /  Github /  Twitter  /  Email
I'm interested in 3D computer vision and its intersections with rendering, simulation, robotics, and animal behavior. My research topics include 3D/4D reconstruction, inverse problems (e.g., inverse rendering, physics and control), and motion generation.
We learn interactive behavior models of an agent grounded in 3D from casual videos.
Dynamic 3D human with cloth & object interactions from a single video, enabled by a two-layer motion field that fuses 3D human prior and generic pixel priors (e.g., normal, flow).
3D Gaussian Splatting for SLAM enables precise camera tracking and high-fidelity reconstruction using an RGBD camera.
An end-to-end motion transfer framework from monocular videos to legged robots.
Given monocular videos, PPR builds 4D models of the object and the environment whose physical configurations satisfy dynamics and contact constraints.
Total-Recon explains an RGBD video with compositional 4D neural fields, which enables extreme view synthesis including embodied views, 3rd-person views, and bird's-eye views.
RAC learns category-level deformable 3D models from monocular videos. It disentangles morphology and motion and allows for motion retargeting.
We distill offline-optimized dynamic NeRFs into efficient video shape, pose, and appearance predictors.
A 3D-aware conditional generative model for controllable image synthesis. Given a 2D label map, such as a segmentation or edge map, our model learns to synthesize images consistent from different viewpoints.
Given casual videos capturing a deformable object, BANMo reconstructs an animatable 3D model in a differentiable volume rendering framework.
Given a long video or multiple short videos, ViSER jointly optimizes articulated 3D shapes and a pixel-surface embedding to establish dense correspondences over video frames.
Given several (8-16) unposed images of the same instance, NeRS optimizes for a textured 3D reconstruction along with the illumination parameters at test-time.
A template-free approach for articulated shape reconstruction from a single video by combining differentiable rendering and data-driven correspondence and segmentation priors.
We analyze how to decompose two frames into a rigid background and multiple moving rigid bodies and propose a neural architecture to segment rigid motion groups given two frames.
We describe a neural architecture to upgrade 2D optical flow to 3D scene flow using optical expansion, which reveals changes in depth of scene elements over frames, e.g., things moving closer will get bigger.
We introduce several simple modifications to the optical flow volumetric layers that: 1) significantly reduces computation and parameters, 2) enables test-time adaptation of cost volume size, and 3) converges much faster.
To adress the problem of real-time stereo matching on high-res imagery, an end-to-end framework that searches for correspondences incrementally over a coarse-to-fine hierarchy is proposed.
We cast the continuous problem of depth regression as discrete binary classification, whose output is the occupancy probabilities on a 3D voxel grid. Such output reliably and efficiently captures multi-modal depth distributions in ambiguous cases.
The 4th CV4Animals workshop at CVPR 2024.
Carnegie Mellon Jazz Choir Fall 2022 performance.