Overview of the SPAR Dataset and Benchmark

Teaser Figure

Overview of our Spatial Perception And Reasoning (SPAR) dataset and benchmark. Our dataset is sourced from 4,500 scenes and comprises 33 spatial tasks spanning single-view, multi-view, and video settings. Our benchmark includes over 7,000+ carefully curated high-quality samples to comprehensively evaluate the spatial perception and understanding capabilities of existing models.

๐Ÿ“„ Paper Summary

Recent advances in vision-language models (VLMs) have greatly improved multimodal understanding, but spatial perception remains a major limitation โ€” especially in complex 3D scenes. To address this, we introduce SPAR-7M, a large-scale dataset built from 3D scenes using a novel 2D annotation pipeline. It covers 33 spatial tasks ranging from perception (e.g., depth, distance) to reasoning (e.g., imagination, object relations) across single-view, multi-view, and video settings.

We further construct SPAR-Bench, a high-quality benchmark of 7,207 human-verified QA pairs across 20 representative spatial tasks, supporting diverse input modalities. Experiments show that pretraining on SPAR-7M significantly improves model performance on spatial benchmarks, even without explicit 3D representations โ€” revealing the strong potential of 2D-supervised spatial learning.

๐Ÿ“ฆ SPAR-7M & SPAR-Bench

SPAR is a comprehensive spatial QA suite composed of two components: SPAR-7M โ€” a large-scale synthetic dataset for training, and SPAR-Bench โ€” a human-curated benchmark for evaluation. Together, they cover a wide spectrum of 3D spatial understanding tasks across various cognitive levels and modalities.

๐Ÿ“š SPAR-7M

  • โœ… 7M+ QA pairs
  • ๐Ÿง  33 spatial task types
  • ๐Ÿ–ผ๏ธ Input: single-view / multi-view / video
  • ๐Ÿ“ Tasks: depth, distance, matching, imagination...
  • ๐Ÿ—๏ธ From 4,500 richly annotated 3D scenes

๐Ÿงช SPAR-Bench

  • ๐Ÿ“ 7,207 QA samples
  • ๐Ÿ‘€ Manually verified high-quality questions
  • ๐Ÿงฉ 20 representative task types
  • ๐ŸŽฏ Zero-shot benchmark for VLMs
  • ๐Ÿš€ Evaluates perception, cross-view & reasoning
Task Distribution

๐Ÿ“Š Task Type Distribution in SPAR

This figure summarizes the distribution of all 33 spatial tasks in SPAR, categorized by cognitive level (Low / Medium / High), input modality (single-view / multi-view / video), and task type.

Notably, 51% of SPAR tasks fall into high-level spatial reasoning, indicating a strong focus on compositional and cognitively demanding skills.

๐Ÿ” SPAR-7M & SPAR-Bench Visualization

SPAR task visualization

This figure illustrates representative examples from SPAR-7M and SPAR-Bench. It showcases the diversity of spatial tasks across input modalities (single-view, multi-view, video), cognitive levels (perception to reasoning), and answer formats (sentence, choice, fill-in). These examples reflect the broad coverage and structured design of our dataset and benchmark.

๐Ÿ•ต๏ธ Explore SPAR-Bench Tasks

This interactive viewer showcases representative samples from the 20 spatial tasks in SPAR-Bench. Each example illustrates different cognitive levels, input modalities, and relation types.

Note that SPAR-Bench only contains multiple-choice and fill-in-the-blank questions. For open-ended or descriptive sentence-based tasks, please refer to the SPAR-7M dataset on Hugging Face.

๐ŸŽฏ Benchmark Evaluation

SPAR-Bench evaluates 20 spatial tasks across three cognitive levels โ€” perception, cross-view understanding, and high-level reasoning. It contains 7,207 human-verified QA pairs across diverse modalities.

Model Avg (%) โ†‘ Low โ†‘ Medium โ†‘ High โ†‘
๐ŸŸค Baselines
Random32.7431.1938.2532.29
Human67.2755.3172.3276.22
๐ŸŸฆ API Models
GPT-4o36.3929.2524.9345.11
Claude-3.7-Sonnet21.7725.437.3323.33
Qwen2-VL-72B35.6235.2823.3940.00
Qwen2.5-VL-72B39.4035.3523.0548.44
๐ŸŸจ Open-source Models
InternVL2-8B33.0226.8336.4937.47
InternVL2.5-8B36.2829.4631.8843.80
LLaVA-OV-7B31.2021.7926.1340.14
Qwen2-VL-7b30.7427.5220.4437.03
Qwen2.5-VL-7b33.0728.7522.9740.27
LLaVA-v1.5-7b23.6510.8527.5034.09
LLaVA-v1.6-7b13.218.534.7920.18
๐ŸŸฅ Fine-tuned
InternVL2.5-8B + SPAR-mix63.2565.5363.0160.19

โš ๏ธ Note: We typically exclude fine-tuned models (like InternVL2.5-8B + SPAR-mix) from direct comparison, as they are trained on SPAR-7M and thus not evaluated in a zero-shot setting.

  • Avg is the mean accuracy across all 20 tasks in SPAR-Bench.
  • Low, Medium, and High are means over task subsets (not equal count).
  • So Avg โ‰  average of Low / Medium / High.
  • Only a subset of models are shown โ€” see our paper for complete results.

๐Ÿ—๏ธ Dataset Construction

Data construction pipeline

Overview of our data construction pipeline. It consists of three main stages.

  1. Scene Pre-processing: We sample keyframes from raw video and extract 3D metadata such as depth, camera pose, and object bounding boxes.
  2. Scene Structuring: All frames, objects, and camera metadata are stored in a unified database format. This enables flexible multi-view and spatial queries for QA generation.
  3. Multi-task QA Generation: We design task-specific templates and automatically fill in questions and answers by selecting tasks, object types, image views, and answer formats. This enables systematic generation of 33 spatial QA types.
3D Grounding Module

๐Ÿงญ 3D Grounding Module

We propose a novel 3D grounding module that integrates seamlessly with Vision-Language Models (VLMs), enabling precise spatial localization from both video and multi-view inputs.

For video inputs, grounding is transformed into a frame selection task, followed by mono-frame 3D localization via predicted UV coordinates, depth, and size. In multi-view settings, the object is localized in a selected frame and then back-projected to recover its full 3D bounding box.

An optional refinement stage further improves accuracy by matching the predicted box against the sceneโ€™s proposal boxes based on geometric similarity.

๐ŸŒ Generalization Across Benchmarks

We evaluate how pretraining with SPAR-mix affects generalization on various spatial understanding and 2D vision-language benchmarks.

The Base VLM is InternVL2.5-8B, trained with high-quality proprietary data. To ensure a fair comparison, we introduce:

Bold values indicate best performance among the two open-source variants (EMOVA vs SPAR-mix).

Method VSI-Bench CV-Bench 2D CV-Bench 3D BLINK 3DSRBench Seed-Image MME MMBench RealWorldQA TextVQA
GPT-4vโ€“64.373.851.14โ€“71.6192775.061.477.4
GPT-4o34.0โ€“โ€“60.0445.377.1231083.475.4โ€“
Cambrian-8Bโ€“72.372.0โ€“โ€“74.7154775.964.277.8
LLaVA-OV-7B25.3โ€“โ€“48.244.1โ€“199880.866.3โ€“
InternVL2-8B34.6โ€“โ€“โ€“โ€“75.4221581.7โ€“77.4
Base VLM32.474.2078.5046.6158.3376.53232384.4566.6768.73
+EMOVA-2M24.566.2764.8342.4055.2573.8218680.2463.1463.78
+SPAR-mix 41.1 72.25 89.08 43.92 57.48 73.2 2163 79.90 64.71 62.91

While the Base VLM is pretrained on high-quality proprietary data and achieves strong performance, we introduce +EMOVA-2M as a fair open-source baseline for comparison.

Under equal data scale (2M), our proposed +SPAR-mix significantly improves performance over EMOVA-only training on spatial benchmarks (e.g., VSI-Bench, CV-Bench 3D), demonstrating the value of our spatial QA data.

๐Ÿ›ฐ๏ธ Evaluation on 3DQA and 3D Grounding

๐Ÿ’ฌ 3D Question Answering

Performance on SQA3D (EM@1) and ScanQA (BLEU-4, CiDEr)

MethodEM@1BLEU-4CiDEr
3D-LLMโ€“12.069.4
Chat-3D v254.714.087.6
LEO50.013.2101.4
LL3DAโ€“13.576.8
Scene-LLM54.212.080.0
SPAR-mix58.115.390.7

๐Ÿ“ก 3D Grounding (ScanRefer)

Accuracy@0.25 and @0.5. Values in brackets are without refinement.

MethodAcc@0.25Acc@0.5
ScanRefer37.324.3
MVT40.833.3
ViL3DRel47.937.7
3D-LLM30.3โ€“
Chat-3D v235.930.4
Grounded 3D-LLM47.944.1
SPAR-mix48.8 (31.9)43.1 (12.4)

๐Ÿ“š Bibtex

If you find our work helpful, please consider citing us:

@article{zhang2025from,
            title={From Flatland to Space: Teaching Vision-Language Models to Perceive and Reason in 3D},
            author={Zhang, Jiahui and Chen, Yurui and Zhou, Yanpeng and Xu, Yueming and Huang, Ze and Mei, Jilin and Chen, Junhui and Yuan, Yujie and Cai, Xinyue and Huang, Guowei and Quan, Xingyue and Xu, Hang and Zhang, Li},
            year={2025},
            journal={arXiv preprint arXiv:2503.22976},
          }
๐Ÿ“œ Sections
๐Ÿ“Œ Paper Overview ๐Ÿ“„ Paper Summary ๐Ÿ“ฆ SPAR-7M & SPAR-Bench ๐Ÿ” SPAR-7M & SPAR-Bench Visualization ๐Ÿ—‚๏ธ Explore SPAR-Bench Tasks ๐ŸŽฏ Benchmark Evaluation ๐Ÿ—๏ธ Dataset Construction ๐Ÿงญ 3D Grounding Module ๐ŸŒ Generalization Across Benchmarks ๐Ÿ›ฐ๏ธ Evaluation on 3DQA and 3D Grounding ๐Ÿ“š Citation