LEGO-Eval: Towards Fine-grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation

Gyeom Hwangbo1*, Hyungjoo Chae2*, Minseok Kang1, Hyeonjong Ju1, Soohyun Oh1, Jinyoung Yeo1
1Yonsei University, 2Georgia Institute of Technology

Abstract

Despite recent progress in using Large Language Models (LLMs) for automatically generating 3D scenes, generated scenes often lack realistic spatial layouts and object attributes found in real-world environments. As this problem stems from insufficiently detailed, coarse-grained instructions, advancing 3D scene synthesis guided by more detailed, fine-grained instructions that reflect real-world environments becomes crucial. Without such realistic scenes, training embodied agents in unrealistic environments can lead them to learn priors that diverge significantly from real-world physics and semantics, degrading their performance when deployed. Thus, verifying the alignment between the fine-grained instruction and the generated scene is essential for effective learning. However, current evaluation methods, such as CLIPScore and vision-language models (VLMs), often fail to reliably assess such alignment. This shortcoming arises primarily from their shallow understanding of 3D scenes, which often leads to improperly grounded scene components. To address this, we introduce LEGO-Eval, an evaluation framework equipped with diverse tools designed to explicitly ground scene components, enabling more accurate alignment assessments. We also present LEGO-Bench, a benchmark of detailed instructions that specify complex layouts and attributes of real-world environments. Experiments demonstrate that LEGO-Eval outperforms VLM-as-a-judge by 0.41 F1 score in assessing scene-instruction alignment. Benchmarking with LEGO-Bench reveals significant limitations in current generation methods. Across all evaluated approaches, success rates reached at most 10% in generating scenes that fully align with fine-grained instructions.

LEGO-Eval

Overview

Overview image

Overview. LEGO-Eval begins by taking a fine-grained text instruction and automatically identifying individual constraints that describe layout, materials, objects, and placements. It then plans a sequence of tool executions, selecting from 21 tools that can retrieve visual, textual, and multimodal information from the 3D scene. Next, it selects proper tool arguments — specific rooms, walls, or objects — and executes the tools to gather evidence about each constraint. Finally, LEGO-Eval validates whether each constraint is satisfied and aggregates all binary results into an interpretable overall score showing how well the generated scene aligns with the instruction.

Results

Results image

Comparison of evaluation methods. All evaluation methods are tested on the LEGO-Bench dataset along with additional scenes that were intentionally created to be misaligned with their instructions, enabling fair comparison between correct and incorrect cases. Performance is measured using F1 score, precision, recall, and Cohen’s kappa at both the holistic level for entire instructions and the partial level for individual constraints. The asterisk * denotes methods that cannot evaluate all constraints and are therefore assessed either on a measurable subset ('Measurable') or by treating unevaluable constraints as incorrect ('Full').

LEGO-Bench

Dataset image

Dataset Description. LEGO-Bench is a benchmark designed to evaluate text-guided 3D scene synthesis using fine-grained, realistic instructions. Each instruction contains multiple constraints describing layout, materials, objects, and placements, reflecting the compositional complexity of real-world indoor scenes. The dataset includes 130 instructions paired with manually aligned 3D scenes, totaling 1,250 annotated constraints. On average, each instruction contains about 10 constraints, covering both architectural elements (walls, floors, doors, windows) and object-level relationships. Together, these detailed annotations enable systematic, constraint-level evaluation of how well generated scenes satisfy natural-language specifications.

BibTeX

@article{hwangbo2025lego,
title   = {LEGO-Eval: Towards Fine-Grained Evaluation on Synthesizing 3D Embodied Environments with Tool Augmentation},
author  = {Hwangbo, Gyeom  and Chae, Hyungjoo and Kang, Minseok and Ju, Hyeonjong and Oh, Soohyun and Yeo, Jinyoung},
journal = {arXiv preprint arXiv:2511.03001},
year    = {2025}
}