CamTrol

Training-free Camera Control for Video Generation

Chen Hou1, Guoqiang Wei2, Yan Zeng2, Zhibo Chen1

1University of Science and Technology of China, 2 ByteDance

A robust, plug-and-play solution to offer camera control for video diffusion models.

Basic Camera Trajectories

Zoom Out Pan Left Tilt Up Truck Right Roll CW
Zoom In Pan Right Tilt Down Pedestal Down Roll ACW
CamTrol produces high-dynamic videos with designated camera moves. No training on specific data is required.

Hybrid and complex Trajectories

Hybrid: Zoom In first, then Pedestal Up.
Hybrid: Zoom Out + Pedestal Up + Truck Left + Tilt Down + Pan Right
complex_1
Complex Trajectory I
complex_2
Complex Trajectory II
complex_3
Complex Trajectory III
Combining basic ones, CamTrol could handle more complicated camera motions and generate videos with cinematic charm. Benifit from explicit camera motion modeling, CamTrol can also load pre-defined complex trajectories in precise coordinates.

3D Rotation-like Generation

Rotate Anticlockwise
Rotate Anticlockwise
Rotate Clockwise
Rotate Clockwise
CamTrol produces impressive generations of 3D rotation-like videos. Compared to 3D generation models, these results hold more diversity in style and have dynamic contents.

CamTrol could also handle 3D object generations. From this perspective, CamTrol can be seen as an infinite source of 3D data. Examples from OmniObject3D.

Motions at Different Scales

Scale I
Scale II
Scale III
CamTrol supports camera movements at controllable scales.

V.S. Prompt Engineering

"A dragon..." "People linger..." "A ceramic cat..." "Roses and birds..." "Volcano explodes..."
+"camera zooms in" +"rotates clockwise" +"rotates anticlockwise" +"pans right" +"zooms in, pedestal up"
+CamTrol +CamTrol +CamTrol +CamTrol +CamTrol
Compared with prompt engineering, CamTrol achieves more accurate camera motion control.

Method


CamTrol includes two-stage process. In stage I, camera movements are modeled through explicit 3D point cloud, leading to renderings indicating specific camera motion. In stage II, layout prior of noisy latents are utilized to guide video generation.

Paper: https://arxiv.org/abs/2406.10126