ViInfographicVQA - Our paper has been accepted!
Dương Trường Bình (Truong-Binh Duong)

“A New Benchmark for Vietnamese Infographic Visual Question Answering”
ViInfographicVQA: A New Benchmark for Vietnamese Infographic Visual Question Answering
AI VIETNAM Lab | December 2025
We are excited to share our latest research, ViInfographicVQA, the first benchmark dedicated to Visual Question Answering (VQA) on Vietnamese infographics. This work addresses a long-standing gap in multilingual multimodal research, where existing benchmarks remain predominantly English-centric and largely limited to natural images or simple scene text.
Motivation
Infographics are a uniquely challenging medium. They integrate text, charts, maps, icons, and structured layouts into a single visual, demanding that models simultaneously perform OCR, layout understanding, and numerical or semantic reasoning. Despite growing interest in document-oriented VQA, no prior benchmark targeted this problem for Vietnamese, let alone for tasks requiring reasoning across multiple related infographics.
Dataset Overview
ViInfographicVQA is built from real-world infographics sourced from infographics.vn, covering topics such as Economics, Healthcare, Culture and Society, Disaster and Accident, and Sports and Arts.
The dataset contains 6,747 infographics and 20,409 human-verified question-answer pairs, organized into two complementary evaluation settings:
Single-image task follows the standard VQA setup, where each question is answered using a single infographic. Answer sources are categorized as Image-span, Question-span, Multi-span, and Non-extractive.
Multi-image task requires synthesizing evidence across multiple semantically related infographics. This is, to our knowledge, the first Vietnamese benchmark evaluating cross-image reasoning in VQA. Multi-image questions are categorized as Multi-image Span, Cross-image Synthesis, and Non-Span.
| Subset | Images | Groups | QA Pairs |
|---|---|---|---|
| Single-image (train) | 1,788 | — | 12,521 |
| Single-image (test) | 192 | — | 1,374 |
| Multi-image (train) | 4,823 | 2,090 | 5,878 |
| Multi-image (test) | 509 | 218 | 636 |
| Total train | 6,096 | 2,090 | 18,399 |
| Total test | 651 | 218 | 2,010 |
Dataset Construction
The pipeline consists of three stages. In the Data Curation stage, infographics are collected and filtered by geometry constraints (aspect ratio within [0.33, 3.00], shorter side at least 512 pixels), then grouped into semantically coherent sets using multimodal embeddings combining OCR text, visual descriptors, and layout signals.
In the Data Generation stage, a VLM-assisted pipeline (Gemini 2.0 Flash) extracts meaningful regions from each infographic and generates structured QA pairs using rule-based prompt templates paired across element types and answer-source categories. For multi-image questions, grouped infographics are used to elicit questions that genuinely require cross-image evidence.
In the Data Refinement stage, automated VLM validation is followed by human review on a dedicated annotation platform, ensuring linguistic naturalness, semantic faithfulness, and answer consistency.
Experimental Results
We evaluate a range of recent open-source vision-language models (VLMs) using Average Normalized Levenshtein Similarity (ANLS) as the primary metric.
Single-image Task
| Model | Overall | Image-span | Multi-span | Non-extractive | Question-span |
|---|---|---|---|---|---|
| Phi-4-multimodal-5B | 41.91 | 45.36 | 23.90 | 39.83 | 58.84 |
| VideoLLaMA3 Image-7B | 49.60 | 55.38 | 35.52 | 44.16 | 63.39 |
| MiniCPM-o2.6-8B | 57.03 | 62.52 | 44.63 | 53.76 | 67.20 |
| Qwen2.5-VL-7B | 67.42 | 73.54 | 50.39 | 65.63 | 80.14 |
| Qwen2.5-VL-7B (Finetuned) | 67.80 | 72.69 | 53.89 | 63.93 | 81.10 |
| InternVL3.5-8B | 67.02 | 73.31 | 49.30 | 65.70 | 79.76 |
| Ovis2.5-9B | 71.02 | 78.19 | 61.42 | 61.24 | 83.21 |
Multi-image Task
| Model | Overall | Cross-image Synthesis | Multi-image Span | Non-Span |
|---|---|---|---|---|
| Phi-4-multimodal-5B | 20.95 | 20.93 | 24.59 | 14.60 |
| InternVL3.5-8B | 35.50 | 10.97 | 51.72 | 24.67 |
| MiniCPM-o2.6-8B | 40.55 | 35.22 | 48.74 | 30.05 |
| Qwen2.5-VL-7B | 54.92 | 56.34 | 58.33 | 47.96 |
| Qwen2.5-VL-7B (Finetuned) | 55.47 | 56.55 | 58.91 | 47.68 |
Fine-tuning Ablation (Qwen2.5-VL-7B)
| Training Configuration | Single-image | Multi-image | Combined |
|---|---|---|---|
| Zero-shot Inference | 67.42 | 54.92 | 63.43 |
| Fine-tuning (Single-image only) | 67.80 | 54.74 | 63.64 |
| Fine-tuning (Multi-image only) | 67.30 | 55.47 | 63.52 |
| Fine-tuning (Combined) | 68.53 | 55.62 | 64.41 |
Key Findings
Modern VLMs achieve mid-60% ANLS on Single-image questions, performing reliably on span-based answer types. However, performance drops substantially by 12 to 32 ANLS points on Multi-image questions, with Cross-image Synthesis being the most difficult category across all models. Non-extractive reasoning also remains a consistent challenge, indicating that current pipelines lack robust mechanisms for layout-aware arithmetic and multi-step inference.
Fine-tuning on combined Single-image and Multi-image data consistently yields the best overall performance, highlighting the complementary roles of both supervision types in building comprehensive infographic understanding.
Resources
The dataset and code are publicly available:
- GitHub: https://github.com/duongtruongbinh/ViInfographicVQA
- HuggingFace Dataset: https://huggingface.co/datasets/duytranus/ViInfographicVQA
This work was supported by AI VIETNAM. For inquiries, please contact the corresponding author at [email protected].