From left to right: Actor(Target 3D Poses), Input Photo, Texture Mapped, and Imitator (Output).

We propose 3DHM, a two-stage framework that reconstructs 3D Human Motions by completing a texture map from a single image and then rendering the 3D humans to imitate the actions of the actor.

Abstract

We present a diffusion model-based framework for animating people from a single image for a given target 3D motion sequence.

Our approach has two core components: a) learning priors about invisible parts of the human body and clothing, and b) rendering novel body poses with proper clothing and texture. For the first part, we learn an in-filling diffusion model to hallucinate unseen parts of a person given a single image. We train this model on texture map space, which makes it more sample-efficient since it is invariant to pose and viewpoint. Second, we develop a diffusion-based rendering pipeline, which is controlled by 3D human poses. This produces realistic renderings of novel poses of the person, including clothing, hair, and plausible in-filling of unseen regions.

This disentangled approach allows our method to generate a sequence of images that are faithful to the target motion in the 3D pose and, to the input image in terms of visual similarity. In addition to that, the 3D control allows various synthetic camera trajectories to render a person. Our experiments show that our method is resilient in generating prolonged motions and varied challenging and complex poses compared to prior methods.

Video

3DHM Training Features

💡 3DHM training pipeline (for both stages) is self-supervised.
💡 3DHM does not use any additional annotations. It is trained with pseudo-ground-truth as we use cutting-edge software
    which can detect, segment, track, and 3Dfy humans (H4D).
💡 3DHM is scalable and its scaling can be done readily in the future given additional videos of humans in motion and
    computing resources.

3DHM Key Features

💡 Various Camera Viewpoints.
💡 Motions from Text.
💡 Motions from Random Videos (Various 3D poses).
💡 Various Camera Azimuths.
💡 Long-range Motions.
💡 Challenging Motions.
💡 Animations from just the Back View.

3DHM Enables Long-range Human Video Generation

3DHM Enables Long-range Human Video Generation

3DHM Enables Random Camera Azimuth

With 3D Control, you can synthesize moving humans from varied camera azimuths.

From left to right: Actor, Input Photo, Imitator (from the original camera azimuth), and Imitator (from varied camera azimuths).

3DHM Enables Animations from just the Back View

Using 3D, you can predict human videos by only seeing the backside of the human photo.

3DHM Enables Human Motion from Text

With 3D Control, you can synthesize moving people from text input.

Text Input: A person turns to his right and paces back and forth.

3DHM Enables 360 degree Camera Viewpoints

With 3D control, you can synthesize moving humans from different camera viewpoints.

3DHM Enables Challenging Motions.

With 3D control, you can synthesize moving with challenging motions.
Please check how the body moves, with special attention to the legs.

More Animations

Make the man dance like the performer

Make the woman walk like the performer

Skating

Ballet

Online Relaxation Course

Dance like Michael Jackson!

Dance like Gene Kelly!

Skate like Michelle Kwan!

BibTeX

@article{li20243dhm,
    author = {Li, Boyi and Rajasegaran, Jathushan and Gandelsman, Yossi and Efros, Alexei A. and Malik, Jitendra},
    title = {Synthesizing Moving People with 3D Control},
    journal = {Arxiv},
    year = {2024},
}