UniUGG

Unified 3D Understanding and Generation via Geometric-Semantic Encoding

Yueming Xu1*, Jiahui Zhang1*, Ze Huang1*, Yurui Chen1, Yanpeng Zhou2, Zhenyu Chen2, Yu-Jie Yuan2, Pengxiang Xia2, Guowei Huang2, Xinyue Cai2, Zhongang Qi2, Xingyue Quan2, Jianye Hao2, Hang Xu2, Li Zhang1†
* Equal contribution    † Corresponding author
1 Fudan University 2 Noah’s Ark Lab
Teaser Figure

Overview of our UniUGG, the first unified framework for spatial understanding and generation. (A) UniUGG supports spatial-level VQA and generates geometrically consistent 3D scenes. (B) Given a reference image, it can creatively generate 3D variations and describe them accurately. (C) UniUGG outperforms baselines in both spatial understanding and generation, with our specially tuned vision encoder excelling in downstream tasks.


Given a reference image and the relative view transformation, UniUGG can generate the corresponding 3D scenes.
Here are the pointmaps of reference views. Press G to see the generated scenes. Press R to reset.

📄 Abstract

Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input's semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation.

🎞️ Video

⚖️ 3D generation comparison

UniUGG accurately captures the input view transformation and leverages the reference image to ‘imagine’ fine-grained spatial structures under novel views, and outputs correct captioning.

Reference image
Reference
Yaw right 40°
CUT3R
Caption
A bathroom scene featuring a window with a small shelf. Below the shelf, a bathtub is partially visible. The walls are tiled, and the overall layout shows the window as the back.
UniUGG
Reference image
Reference
Yaw right 40°
CUT3R
Caption
A window with two panes is centered in the image, flanked by two frosted glass panels. On the windowsill, there is a small decorative item.
UniUGG
Reference image
Reference
Pitch down 40°, Yaw left 40°
CUT3R
Caption
A gray sectional sofa is positioned to the right, with a small table in front of it holding an item. In the background, a dining area is visible.
UniUGG

🔍 3D scene generations and captions

Given a reference image, we randomly sample plausible relative view transformations and let UniUGG generate the corresponding 3D scenes, and further caption the generated 3D scenes.

Reference image
Reference Image
Random seed 1
Random seed 2
Caption
A kitchen countertop with a double sink and a faucet, featuring a dish rack with dishes and a container to the right. A kettle sits on the counter to the left, with a window above the sink allowing natural light to enter.
Caption
A kitchen sink area with a dish rack on the right holding a plate of food. To the left, there are cabinets and a countertop with items near the window.
Reference image
Reference Image
Random seed 1
Random seed 2
Caption
A neatly made bed with white linens and a gray headboard is centered against a gray upholstered headboard. Above the bed, a framed artwork hang on the wall, and a small bedside table with a lamp is visible to the right.
Caption
A neatly made bed with white linens and pillows are positioned against a wall, with a gray upholstered headboard. Above the bed, a framed artwork hang on the wall, and a small bedside table with a lamp is visible to the right.

💬 Spatial understanding

UniUGG can capture fine-grained spatial relations and support spatial visual question answering (VQA) tasks.

👤 Question:

These are frames of a video. How many ceiling light(s) are in this room?

🤖 Answer:

The video reveals a total of 2 ceiling light occupying the room.

Image 1 Image 2 Image 3
👤 Question:

In the given three images, what is the spatial alignment of the shoe (red bbox) relative to the plant pot (blue bbox) as perceived by the observer? Set the first image as the observer’s perspective.

🤖 Answer:

In this arrangement, the shoe (red bbox) is positioned to the left below the plant pot (blue bbox). From the observer’s perspective, it looks farther. The shoe (red bbox) and the plant pot (blue bbox) are in front of the observer.

Image 1 Image 2
👤 Question:

The images are frames from a video. The video is shooting a static scene. The camera is either moving clockwise (left) or counter-clockwise (right) around the object. The first image is from the beginning of the video and the second image is from the end. Is the camera moving left or right when shooting the video?

🤖 Answer:

From the observer's perspective, the camera transitions left.

Image 1
👤 Question:

Consider the real-world 3D locations of the objects. Which object has a lower location?
A: fan
B: people

🤖 Answer:

The answer is A. The object fan is located at a lower height in the real world.

🎯Evaluation on spatial understanding and 3D generation

💬 Spatial understanding

3D understanding performance on various spatial reasoning benchmarks.

Method VSI BLINK 3DSR SPAR
Low Med. High Avg.
📂 Open-source models
LLaVA-v1.5-7B 18.0 37.1 38.1 10.9 26.5 34.1 23.7
LLaVA-NeXT-7B 20.6 41.8 48.4 8.5 4.8 20.2 13.2
InternVL2.5-8B 32.5 54.8 50.9 29.5 31.9 43.8 36.3
Qwen2.5-VL-7B 30.3 56.4 48.4 28.8 23.0 40.3 33.1
☁️ API models
GPT-4o 34.0 60.0 44.2 36.9 26.5 43.8 38.1
🔄 Unified understanding-generation models
Janus-Pro-1B - 38.9 50.0 10.7 24.7 30.8 20.6
Janus-Pro-7B - 40.5 53.7 27.3 24.6 33.9 28.6
UniUGG-3B (Ours) 40.1 43.6 52.1 50.8 41.7 45.7 47.2

📦 3D generation

Quantitative spatial generation comparison on ARKitScenes and ScanNet++ datasets.

Method ARKitScenes ScanNet++
FID↓ KID↓ LPIPS↓ FID↓ KID↓ LPIPS↓
🧪 Ablation setting
w/ RADIO 64.16 .0518 .4904 73.69 .0614 .4629
w/ MASt3R Enc. 81.18 .0691 .5076 86.79 .0803 .5242
w/o Dec. finetune 149.97 .1447 .5301 168.05 .1686 .4945
w/o Diff. 87.51 .0672 .4494 114.93 .0955 .4345
📏 Baselins
CUT3R 138.54 .1128 .5758 130.76 .1051 .5637
LVSM 269.45 .3088 .5067 414.63 .5117 .5865
UniUGG (Ours) 55.01 .0425 .4849 55.64 .0442 .4263

⚙️ Method overview

Vision encoder pretraining

🟦 Vision encoder pretraining

We introduce a novel geometric-semantic vision encoder pretraining strategy.
(a) During semantic guiding, our student encoder learns to mimic the teacher's visual representations.
(b) In spatial representation learning, the spatial decoder jointly refines predictions using information from both views.

Spatial understanding and generation training

🟨 Spatial understanding and generation training

(a) In the latent token learning stage, visual representation is compressed using the Spatial-VAE, while the spatial decoder is linked for fine-tuning.
(b) In the unified learning stage, the reference image’s visual representation and view transformation are input to an LLM, which outputs conditional features for noise prediction on latent token. The LLM also performs VQA-related training to maintain its understanding capability.

Task Distribution

🟥 Spatial understanding and generation inferencing

(a) We achieve 3D generation by generating the target-view’s visual representation using the LLM and diffusion model.
(b) The LLM performs VQA using visual representations as input, whether generated or real.
(c) The visual representations of both target and reference views are input to the pretrained spatial decoder to decode 3D scene.

📚 Bibtex

If you find our work helpful, please consider citing us:

@article{xu2025uniugg,
title={UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding},
author={Xu, Yueming and Zhang, Jiahui and Huang, Ze and Chen, Yurui and Zhou, Yanpeng and Chen Zhenyu and Yuan, Yujie and Xia, Pengxiang and Huang, Guowei and Cai, Xinyue and Qi, Zhongang and Quan, Xingyue and Hao, Jianye and Xu, Hang and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2508.11952},
}
📜 Sections
📌 Paper overview 🌐 Generation visualization 📄 Abstract 🎞️ Video ⚖️ 3D generation comparison 🔍 3D scene generations and captions 💬 Spatial understanding 🎯 Evaluation ⚙️ Method overview 📚 Citation