DiMo-GUI-homepage

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning (EMNLP 2025)

¹University of California, Merced, ²The University of Queensland,
³vivo Mobile Communication Co., Ltd
^†Indicates Corresponding Author

Abstract

Grounding natural language queries in graphical user interfaces (GUIs) poses unique challenges due to the diversity of visual elements, spatial clutter, and the ambiguity of language. In this paper, we introduce DiMo-GUI, a training-free framework for GUI grounding that leverages two core strategies: modality-wise decomposition and progressive visual zoom-in. Instead of treating the GUI as a monolithic image, our method splits the input into textual elements and iconic elements, allowing the model to reason over each modality independently using general-purpose vision-language models (VLMs). When predictions are ambiguous or incorrect, DiMo-GUI dynamically focuses attention by generating candidate focal regions centered on the model's initial predictions and incrementally zooms into subregions to refine the grounding result. This hierarchical refinement process helps disambiguate visually crowded layouts without the need for additional training or annotations. We evaluate our approach on standard GUI grounding benchmarks and demonstrate consistent improvements over baseline inference pipelines, highlighting the effectiveness of combining modality separation with region-focused reasoning.

@article{wu2025dimo, title={DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning}, author={Wu, Hang and Chen, Hongkai and Cai, Yujun and Liu, Chang and Ye, Qingwen and Yang, Ming-Hsuan and Wang, Yiwei}, year={2025} }

DiMo-GUI: Advancing Test-time Scaling in GUI Grounding via Modality-Aware Visual Reasoning (EMNLP 2025)

DiMo-GUI searches separately within the text and icon elements based on the instruction and the screenshot.

By integrating DiMo-GUI, existing models can achieve significant performance improvements on ScreenSpot-Pro benchmark.

Abstract

Framework

Quantitative Results

You can slide to view more results.

Quantitative Results

You can slide to view more results.

BibTeX