Introduction

High-resolution image (HRI) understanding aims to process images with a large number of pixels such as pathological images and agricultural aerial images, both of which can exceed 1 million pixels. Vision Large Language Models (VLMs) typically handle higher-resolution images through dynamic patching. However, there is a lack of a comprehensive benchmark for VLMs to evaluate HRI understanding, leaving this domain underexplored. To address this gap, we introduce LogoHRScene, a novel unified benchmark for HRI understanding with rich scenes. LogoHRScene incorporates 25 real-world datasets and 2 synthetic diagnostic datasets with resolutions ranging from 1,024 × 1,024 to 35,503 × 26,627. LogoHRScene is collected and re-annotated by 10 graduate-level annotators, covering 25 scenarios, ranging from microscopic and radiology images to street views, long-range pictures, and telescope images. It includes high-resolution images of real-world objects, scanned documents, and composite multi-image. The two diagnostic evaluation datasets are synthesized by combining the target image with the gold answer and similar distracting images in different orders. These datasets assess how well models utilize HRI by comparing performance across different image regions. We conduct extensive experiments involving 27 VLMs, including Gemini 2.0 Pro and GPT-4o. Experiments on LogoHRScene show that current VLMs achieve an average accuracy of around 50% on real-world tasks, revealing significant gaps in HRI understanding. Results on our synthetic datasets reveal that VLMs struggle to effectively utilize HRI regions compared to low-resolution images, with a gap exceeding 20%. Our code and data will be publicly available.

Leaderboard on HRScene (testmini)

Accuracy scores on the testmini subset (1,000 examples) of Logo HRScene.

Leaderboard on HRScene (test)

Accuracy scores on the test subset (5,323 examples with private ground truth) of Logo HRScene.

🚨 Open source models are marked in Green and Closed source models are marked in Red

🚨 You are welcome to submit your model to this leaderboard!
To achieve this, we provide a simple pipeline for automatic model prediction and submission in EvalAI!
You can find pipeline from our Github repo under "🔮 Evaluations on HRScene for RealWorld Task" section.

Logo HRScene Dataset

Overview

Logo HRScene is collected from 25 existing data resources and 8 of them are re-annotated by 10 graduate-level annotators, with diverse view scales, ranging from microscope to radiology, street views, long-range, and telescope images. It contains high-resolution images of real objects, electronic documents, and composite multi-subimages. Besides, six datasets require domain-expert knowledge, while the remaining 19 belong to general domains. The diagnostic dataset is synthesized by combining the target image with the gold answer and visually similar distractors arranged in different orders to assess HRI utilization. Overall, Logo HRScene comprises 7,081 images, with 2,008 of them being re-annotated.

LogoHRScene consists of 7,073 samples, divided into three splits:

  • val contains 750 samples. These samples are identical to human-annotated ones, designed for fine-grained validation of the users' VLM settings.
  • testmini comprises 1,000 samples, picked from each LogoHRScene real-world dataset, intended for rapid model development evaluation or for those with limited computing resources.
  • test features the remaining 5,323 samples for standard evaluation. Notably, the answer labels for test will not be publicly released to facilitate fair evaluation. Instead, we maintain an online evaluation platform for user submissions.
You can download the dataset on Hugging Face Dataset.

data-overview

Key statistics of Logo HRScene.

data-composition

Source dataset distribution of Logo HRScene.

Examples

One example for each category in Logo HRScene

Experiment Results

Results on 25 Real-World Datasets

Visualization Examples

🚨 We also provide convenient tool to diagnose your own model with 5 lines of code!
Here is the Hugging Face Dataset and the code for diagnosis is in Github repo under "🔮 Evaluations on HRScene for Diagnosis Task" section.

Explorer

Explore the outputs of each model on Logo HRScene

BibTeX


      @article{zhang2025hrscene,
        title={HRScene: How Far Are VLMs from Effective High-Resolution Image Understanding?},
        author={Zhang, Yusen and Zheng, Wenliang and Madasu, Aashrith and Shi, Peng and Kamoi, Ryo and Zhou, Hao and Zou, Zhuoyang and Zhao, Shu and Das, Sarkar Snigdha Sarathi and Gupta, Vipul and Lu, Xiaoxin and Zhang, Nan and Zhang, Ranran Haoran and Iyer, Avitej and Lou, Renze and Yin, Wenpeng and Zhang, Rui},
        journal={arXiv preprint},
        year={2025}
      }