ViInfographicVQA: A New Benchmark for Vietnamese Infographic Visual Question Answering

AI VIETNAM Lab | December 2025

We are excited to share our latest research, ViInfographicVQA, the first benchmark dedicated to Visual Question Answering (VQA) on Vietnamese infographics. This work addresses a long-standing gap in multilingual multimodal research, where existing benchmarks remain predominantly English-centric and largely limited to natural images or simple scene text.

Motivation

Infographics are a uniquely challenging medium. They integrate text, charts, maps, icons, and structured layouts into a single visual, demanding that models simultaneously perform OCR, layout understanding, and numerical or semantic reasoning. Despite growing interest in document-oriented VQA, no prior benchmark targeted this problem for Vietnamese, let alone for tasks requiring reasoning across multiple related infographics.

Dataset Overview

ViInfographicVQA is built from real-world infographics sourced from infographics.vn, covering topics such as Economics, Healthcare, Culture and Society, Disaster and Accident, and Sports and Arts.

The dataset contains 6,747 infographics and 20,409 human-verified question-answer pairs, organized into two complementary evaluation settings:

Single-image task follows the standard VQA setup, where each question is answered using a single infographic. Answer sources are categorized as Image-span, Question-span, Multi-span, and Non-extractive.

Multi-image task requires synthesizing evidence across multiple semantically related infographics. This is, to our knowledge, the first Vietnamese benchmark evaluating cross-image reasoning in VQA. Multi-image questions are categorized as Multi-image Span, Cross-image Synthesis, and Non-Span.

Subset	Images	Groups	QA Pairs
Single-image (train)	1,788	—	12,521
Single-image (test)	192	—	1,374
Multi-image (train)	4,823	2,090	5,878
Multi-image (test)	509	218	636
Total train	6,096	2,090	18,399
Total test	651	218	2,010

Dataset Construction

The pipeline consists of three stages. In the Data Curation stage, infographics are collected and filtered by geometry constraints (aspect ratio within [0.33, 3.00], shorter side at least 512 pixels), then grouped into semantically coherent sets using multimodal embeddings combining OCR text, visual descriptors, and layout signals.

In the Data Generation stage, a VLM-assisted pipeline (Gemini 2.0 Flash) extracts meaningful regions from each infographic and generates structured QA pairs using rule-based prompt templates paired across element types and answer-source categories. For multi-image questions, grouped infographics are used to elicit questions that genuinely require cross-image evidence.

In the Data Refinement stage, automated VLM validation is followed by human review on a dedicated annotation platform, ensuring linguistic naturalness, semantic faithfulness, and answer consistency.

Experimental Results

We evaluate a range of recent open-source vision-language models (VLMs) using Average Normalized Levenshtein Similarity (ANLS) as the primary metric.

Single-image Task

Model	Overall	Image-span	Multi-span	Non-extractive	Question-span
Phi-4-multimodal-5B	41.91	45.36	23.90	39.83	58.84
VideoLLaMA3 Image-7B	49.60	55.38	35.52	44.16	63.39
MiniCPM-o2.6-8B	57.03	62.52	44.63	53.76	67.20
Qwen2.5-VL-7B	67.42	73.54	50.39	65.63	80.14
Qwen2.5-VL-7B (Finetuned)	67.80	72.69	53.89	63.93	81.10
InternVL3.5-8B	67.02	73.31	49.30	65.70	79.76
Ovis2.5-9B	71.02	78.19	61.42	61.24	83.21

Multi-image Task

Model	Overall	Cross-image Synthesis	Multi-image Span	Non-Span
Phi-4-multimodal-5B	20.95	20.93	24.59	14.60
InternVL3.5-8B	35.50	10.97	51.72	24.67
MiniCPM-o2.6-8B	40.55	35.22	48.74	30.05
Qwen2.5-VL-7B	54.92	56.34	58.33	47.96
Qwen2.5-VL-7B (Finetuned)	55.47	56.55	58.91	47.68

Fine-tuning Ablation (Qwen2.5-VL-7B)

Training Configuration	Single-image	Multi-image	Combined
Zero-shot Inference	67.42	54.92	63.43
Fine-tuning (Single-image only)	67.80	54.74	63.64
Fine-tuning (Multi-image only)	67.30	55.47	63.52
Fine-tuning (Combined)	68.53	55.62	64.41

Key Findings

Modern VLMs achieve mid-60% ANLS on Single-image questions, performing reliably on span-based answer types. However, performance drops substantially by 12 to 32 ANLS points on Multi-image questions, with Cross-image Synthesis being the most difficult category across all models. Non-extractive reasoning also remains a consistent challenge, indicating that current pipelines lack robust mechanisms for layout-aware arithmetic and multi-step inference.

Fine-tuning on combined Single-image and Multi-image data consistently yields the best overall performance, highlighting the complementary roles of both supervision types in building comprehensive infographic understanding.

Resources

The dataset and code are publicly available:

GitHub: https://github.com/duongtruongbinh/ViInfographicVQA
HuggingFace Dataset: https://huggingface.co/datasets/duytranus/ViInfographicVQA

This work was supported by AI VIETNAM. For inquiries, please contact the corresponding author at [email protected].

ViInfographicVQA - Our paper has been accepted!