Generating dynamic three-dimensional (3D) object from a single-view video is challenging due to the lack of 4D labeled data. Existing methods extend text-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling, but they are slow and expensive to scale (e.g., 150 minutes per object) due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this limitation, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly train a novel 4D Gaussian splatting model with explicit point cloud geometry, enabling real-time rendering under continuous camera trajectories. Extensive experiments on synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the same level of innovative view synthesis quality. For example, Efficient4D takes only 14 minutes to model a dynamic object.
Given as the input a brief video depicting a dynamic object from a single view, our model aims to generate this object with geometrical and temporal consistency under any specific view and time. Efficient4D comprises two components: (A) Image sequence synthesis across views and timestamps, resulting in (c) an image matrix where each row consists of multi-view geometrically consistent images and each column consists of view-specific temporally consistent images. (B) A novel 4D Gaussian representation model (d) that represents the scene with a number of Gaussian points. It can be trained efficiently and robustly under the confidence-aware (e) supervision on the generated image matrix.
@article{pan2024fast,
title={Fast Dynamic 3D Object Generation from a Single-view Video},
author={Pan, Zijie and Yang, Zeyu and Zhu, Xiatian and Zhang, Li},
journal={arXiv preprint arXiv 2401.08742},
year={2024}
}