Pulse
Research Frontier

ViInfographicVQA - Our paper has been accepted!

D

Dương Trường Bình (Truong-Binh Duong)

ViInfographicVQA - Our paper has been accepted!

A New Benchmark for Vietnamese Infographic Visual Question Answering

ViInfographicVQA: A New Benchmark for Vietnamese Infographic Visual Question Answering

AI VIETNAM Lab | December 2025


We are excited to share our latest research, ViInfographicVQA, the first benchmark dedicated to Visual Question Answering (VQA) on Vietnamese infographics. This work addresses a long-standing gap in multilingual multimodal research, where existing benchmarks remain predominantly English-centric and largely limited to natural images or simple scene text.

Motivation

Infographics are a uniquely challenging medium. They integrate text, charts, maps, icons, and structured layouts into a single visual, demanding that models simultaneously perform OCR, layout understanding, and numerical or semantic reasoning. Despite growing interest in document-oriented VQA, no prior benchmark targeted this problem for Vietnamese, let alone for tasks requiring reasoning across multiple related infographics.

Dataset Overview

ViInfographicVQA is built from real-world infographics sourced from infographics.vn, covering topics such as Economics, Healthcare, Culture and Society, Disaster and Accident, and Sports and Arts.

The dataset contains 6,747 infographics and 20,409 human-verified question-answer pairs, organized into two complementary evaluation settings:

Single-image task follows the standard VQA setup, where each question is answered using a single infographic. Answer sources are categorized as Image-span, Question-span, Multi-span, and Non-extractive.

Multi-image task requires synthesizing evidence across multiple semantically related infographics. This is, to our knowledge, the first Vietnamese benchmark evaluating cross-image reasoning in VQA. Multi-image questions are categorized as Multi-image Span, Cross-image Synthesis, and Non-Span.

SubsetImagesGroupsQA Pairs
Single-image (train)1,78812,521
Single-image (test)1921,374
Multi-image (train)4,8232,0905,878
Multi-image (test)509218636
Total train6,0962,09018,399
Total test6512182,010

Dataset Construction

The pipeline consists of three stages. In the Data Curation stage, infographics are collected and filtered by geometry constraints (aspect ratio within [0.33, 3.00], shorter side at least 512 pixels), then grouped into semantically coherent sets using multimodal embeddings combining OCR text, visual descriptors, and layout signals.

In the Data Generation stage, a VLM-assisted pipeline (Gemini 2.0 Flash) extracts meaningful regions from each infographic and generates structured QA pairs using rule-based prompt templates paired across element types and answer-source categories. For multi-image questions, grouped infographics are used to elicit questions that genuinely require cross-image evidence.

In the Data Refinement stage, automated VLM validation is followed by human review on a dedicated annotation platform, ensuring linguistic naturalness, semantic faithfulness, and answer consistency.

Experimental Results

We evaluate a range of recent open-source vision-language models (VLMs) using Average Normalized Levenshtein Similarity (ANLS) as the primary metric.

Single-image Task

ModelOverallImage-spanMulti-spanNon-extractiveQuestion-span
Phi-4-multimodal-5B41.9145.3623.9039.8358.84
VideoLLaMA3 Image-7B49.6055.3835.5244.1663.39
MiniCPM-o2.6-8B57.0362.5244.6353.7667.20
Qwen2.5-VL-7B67.4273.5450.3965.6380.14
Qwen2.5-VL-7B (Finetuned)67.8072.6953.8963.9381.10
InternVL3.5-8B67.0273.3149.3065.7079.76
Ovis2.5-9B71.0278.1961.4261.2483.21

Multi-image Task

ModelOverallCross-image SynthesisMulti-image SpanNon-Span
Phi-4-multimodal-5B20.9520.9324.5914.60
InternVL3.5-8B35.5010.9751.7224.67
MiniCPM-o2.6-8B40.5535.2248.7430.05
Qwen2.5-VL-7B54.9256.3458.3347.96
Qwen2.5-VL-7B (Finetuned)55.4756.5558.9147.68

Fine-tuning Ablation (Qwen2.5-VL-7B)

Training ConfigurationSingle-imageMulti-imageCombined
Zero-shot Inference67.4254.9263.43
Fine-tuning (Single-image only)67.8054.7463.64
Fine-tuning (Multi-image only)67.3055.4763.52
Fine-tuning (Combined)68.5355.6264.41

Key Findings

Modern VLMs achieve mid-60% ANLS on Single-image questions, performing reliably on span-based answer types. However, performance drops substantially by 12 to 32 ANLS points on Multi-image questions, with Cross-image Synthesis being the most difficult category across all models. Non-extractive reasoning also remains a consistent challenge, indicating that current pipelines lack robust mechanisms for layout-aware arithmetic and multi-step inference.

Fine-tuning on combined Single-image and Multi-image data consistently yields the best overall performance, highlighting the complementary roles of both supervision types in building comprehensive infographic understanding.

Resources

The dataset and code are publicly available:


This work was supported by AI VIETNAM. For inquiries, please contact the corresponding author at [email protected].

Research NoteVLAI-v3
Back to Lab Pulse
VLAI Logo
VLAI Research

Vision and Language in AI (VLAI), a Vietnamese research group cultivating a deeper harmony between multimodal perception and human language in AI.

Hugging Face

Contact

Computer Science Dept.

Research Laboratory

[email protected]

© 2026 VLAI Research Group. All rights reserved.

ViInfographicVQA - Our paper has been accepted! | VLAI Research Group