UniUGG

📄 Abstract

Despite the impressive progress on understanding and generating images shown by the recent unified architectures, the integration of 3D tasks remains challenging and largely unexplored. In this paper, we introduce UniUGG, the first unified understanding and generation framework for 3D modalities. Our unified framework employs an LLM to comprehend and decode sentences and 3D representations. At its core, we propose a spatial decoder leveraging a latent diffusion model to generate high-quality 3D representations. This allows for the generation and imagination of 3D scenes based on a reference image and an arbitrary view transformation, while remaining supports for spatial visual question answering (VQA) tasks. Additionally, we propose a geometric-semantic learning strategy to pretrain the vision encoder. This design jointly captures the input's semantic and geometric cues, enhancing both spatial understanding and generation. Extensive experimental results demonstrate the superiority of our method in visual representation, spatial understanding, and 3D generation.

🎞️ Video

⚖️ 3D generation comparison

UniUGG accurately captures the input view transformation and leverages the reference image to ‘imagine’ fine-grained spatial structures under novel views, and outputs correct captioning.

Reference image

Yaw right 40°

CUT3R

Caption

A bathroom scene featuring a window with a small shelf. Below the shelf, a bathtub is partially visible. The walls are tiled, and the overall layout shows the window as the back.

UniUGG

Reference image

Yaw right 40°

CUT3R

Caption

A window with two panes is centered in the image, flanked by two frosted glass panels. On the windowsill, there is a small decorative item.

UniUGG

Reference image

Pitch down 40°, Yaw left 40°

CUT3R

Caption

A gray sectional sofa is positioned to the right, with a small table in front of it holding an item. In the background, a dining area is visible.

UniUGG

🔍 3D scene generations and captions

Given a reference image, we randomly sample plausible relative view transformations and let UniUGG generate the corresponding 3D scenes, and further caption the generated 3D scenes.

Reference image

Random seed 1

Random seed 2

Caption

A kitchen countertop with a double sink and a faucet, featuring a dish rack with dishes and a container to the right. A kettle sits on the counter to the left, with a window above the sink allowing natural light to enter.

Caption

A kitchen sink area with a dish rack on the right holding a plate of food. To the left, there are cabinets and a countertop with items near the window.

Reference image

Random seed 1

Random seed 2

Caption

A neatly made bed with white linens and a gray headboard is centered against a gray upholstered headboard. Above the bed, a framed artwork hang on the wall, and a small bedside table with a lamp is visible to the right.

Caption

A neatly made bed with white linens and pillows are positioned against a wall, with a gray upholstered headboard. Above the bed, a framed artwork hang on the wall, and a small bedside table with a lamp is visible to the right.

💬 Spatial understanding

UniUGG can capture fine-grained spatial relations and support spatial visual question answering (VQA) tasks.

👤 Question:

These are frames of a video. How many ceiling light(s) are in this room?

🤖 Answer:

The video reveals a total of 2 ceiling light occupying the room.

👤 Question:

In the given three images, what is the spatial alignment of the shoe (red bbox) relative to the plant pot (blue bbox) as perceived by the observer? Set the first image as the observer’s perspective.

🤖 Answer:

In this arrangement, the shoe (red bbox) is positioned to the left below the plant pot (blue bbox). From the observer’s perspective, it looks farther. The shoe (red bbox) and the plant pot (blue bbox) are in front of the observer.

👤 Question:

The images are frames from a video. The video is shooting a static scene. The camera is either moving clockwise (left) or counter-clockwise (right) around the object. The first image is from the beginning of the video and the second image is from the end. Is the camera moving left or right when shooting the video?

🤖 Answer:

From the observer's perspective, the camera transitions left.

👤 Question:

Consider the real-world 3D locations of the objects. Which object has a lower location?
A: fan
B: people

🤖 Answer:

The answer is A. The object fan is located at a lower height in the real world.

🎯Evaluation on spatial understanding and 3D generation

💬 Spatial understanding

3D understanding performance on various spatial reasoning benchmarks.

Method	VSI	BLINK	3DSR	SPAR
Method	VSI	BLINK	3DSR	Low	Med.	High	Avg.
📂 Open-source models
LLaVA-v1.5-7B	18.0	37.1	38.1	10.9	26.5	34.1	23.7
LLaVA-NeXT-7B	20.6	41.8	48.4	8.5	4.8	20.2	13.2
InternVL2.5-8B	32.5	54.8	50.9	29.5	31.9	43.8	36.3
Qwen2.5-VL-7B	30.3	56.4	48.4	28.8	23.0	40.3	33.1
☁️ API models
GPT-4o	34.0	60.0	44.2	36.9	26.5	43.8	38.1
🔄 Unified understanding-generation models
Janus-Pro-1B	-	38.9	50.0	10.7	24.7	30.8	20.6
Janus-Pro-7B	-	40.5	53.7	27.3	24.6	33.9	28.6
UniUGG-3B (Ours)	40.1	43.6	52.1	50.8	41.7	45.7	47.2

📦 3D generation

Quantitative spatial generation comparison on ARKitScenes and ScanNet++ datasets.

Method	ARKitScenes			ScanNet++
Method	FID↓	KID↓	LPIPS↓	FID↓	KID↓	LPIPS↓
🧪 Ablation setting
w/ RADIO	64.16	.0518	.4904	73.69	.0614	.4629
w/ MASt3R Enc.	81.18	.0691	.5076	86.79	.0803	.5242
w/o Dec. finetune	149.97	.1447	.5301	168.05	.1686	.4945
w/o Diff.	87.51	.0672	.4494	114.93	.0955	.4345
📏 Baselins
CUT3R	138.54	.1128	.5758	130.76	.1051	.5637
LVSM	269.45	.3088	.5067	414.63	.5117	.5865
UniUGG (Ours)	55.01	.0425	.4849	55.64	.0442	.4263

⚙️ Method overview

🟦 Vision encoder pretraining

We introduce a novel geometric-semantic vision encoder pretraining strategy.
(a) During semantic guiding, our student encoder learns to mimic the teacher's visual representations.
(b) In spatial representation learning, the spatial decoder jointly refines predictions using information from both views.

🟨 Spatial understanding and generation training

(a) In the latent token learning stage, visual representation is compressed using the Spatial-VAE, while the spatial decoder is linked for fine-tuning.
(b) In the unified learning stage, the reference image’s visual representation and view transformation are input to an LLM, which outputs conditional features for noise prediction on latent token. The LLM also performs VQA-related training to maintain its understanding capability.

🟥 Spatial understanding and generation inferencing

(a) We achieve 3D generation by generating the target-view’s visual representation using the LLM and diffusion model.
(b) The LLM performs VQA using visual representations as input, whether generated or real.
(c) The visual representations of both target and reference views are input to the pretrained spatial decoder to decode 3D scene.

📚 Bibtex

If you find our work helpful, please consider citing us:

@article{xu2025uniugg,
title={UniUGG: Unified 3D Understanding and Generation via Geometric-Semantic Encoding},
author={Xu, Yueming and Zhang, Jiahui and Huang, Ze and Chen, Yurui and Zhou, Yanpeng and Chen Zhenyu and Yuan, Yujie and Xia, Pengxiang and Huang, Guowei and Cai, Xinyue and Qi, Zhongang and Quan, Xingyue and Hao, Jianye and Xu, Hang and Zhang, Li},
year={2025},
journal={arXiv preprint arXiv:2508.11952},
}

📜 Sections