Generating dynamic 3D object from a single-view video is challenging due to the lack of 4D labeled data. An intuitive approach is to extend previous image-to-3D pipelines by transferring off-the-shelf image generation models such as score distillation sampling. However, this approach would be slow and expensive to scale due to the need for back-propagating the information-limited supervision signals through a large pretrained model. To address this, we propose an efficient video-to-4D object generation framework called Efficient4D. It generates high-quality spacetime-consistent images under different camera views, and then uses them as labeled data to directly reconstruct the 4D content through a 4D Gaussian splatting model. Importantly, our method can achieve real-time rendering under continuous camera trajectories. To enable robust reconstruction under sparse views, we introduce inconsistency-aware confidence-weighted loss design, along with a lightly weighted score distillation loss. Extensive experiments on both synthetic and real videos show that Efficient4D offers a remarkable 10-fold increase in speed when compared to prior art alternatives while preserving the quality of novel view synthesis. For example, Efficient4D takes only 10 minutes to model a dynamic object, vs 120 minutes by the previous art model Consistent4D.
Given as the input (a) a brief video depicting a dynamic object from a single perspectives, our model aims to generate this object with geometrical and temporal consistency under any specific view and time. Efficient4D comprises two components: (A) Image sequence synthesis through (b) time-synchronous spatial volumes, resulting in (c) an image matrix where each row consists of multi-view geometrically consistent images and each column consists of view-specific temporally consistent images. (B) 4D Reconstruction using the generated images in (A). The 4D Gaussian representation can be trained efficiently and robustly under the confidence-weighted loss and the low-weighted SDS loss.
@article{pan2024fast,
title={Fast Dynamic 3D Object Generation from a Single-view Video},
author={Pan, Zijie and Yang, Zeyu and Zhu, Xiatian and Zhang, Li},
journal={arXiv preprint arXiv 2401.08742},
year={2024}
}