Saved in:
| Main Authors: | Wu, Tong, Markchom, Thanet |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2601.03073 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs
by: Xu, Jianfei, et al.
Published: (2024)
by: Xu, Jianfei, et al.
Published: (2024)
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
by: Hong, Jiaying, et al.
Published: (2025)
by: Hong, Jiaying, et al.
Published: (2025)
SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models
by: Liu, Bo, et al.
Published: (2025)
by: Liu, Bo, et al.
Published: (2025)
MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space
by: Singh, Anshul, et al.
Published: (2025)
by: Singh, Anshul, et al.
Published: (2025)
GRAM: Global Reasoning for Multi-Page VQA
by: Blau, Tsachi, et al.
Published: (2024)
by: Blau, Tsachi, et al.
Published: (2024)
Embodied Scene Understanding for Vision Language Models via MetaVQA
by: Wang, Weizhen, et al.
Published: (2025)
by: Wang, Weizhen, et al.
Published: (2025)
NCL-UoR at SemEval-2026 Task 5: Embedding-Based Methods, Fine-Tuning, and LLMs for Word Sense Plausibility Rating
by: Wu, Tong, et al.
Published: (2026)
by: Wu, Tong, et al.
Published: (2026)
ToonCrafter: Generative Cartoon Interpolation
by: Xing, Jinbo, et al.
Published: (2024)
by: Xing, Jinbo, et al.
Published: (2024)
Open3D-VQA: A Benchmark for Comprehensive Spatial Reasoning with Multimodal Large Language Model in Open Space
by: Zhang, Weichen, et al.
Published: (2025)
by: Zhang, Weichen, et al.
Published: (2025)
PuzzleVQA: Diagnosing Multimodal Reasoning Challenges of Language Models with Abstract Visual Patterns
by: Chia, Yew Ken, et al.
Published: (2024)
by: Chia, Yew Ken, et al.
Published: (2024)
ReasonVQA: A Multi-hop Reasoning Benchmark with Structural Knowledge for Visual Question Answering
by: Tran, Duong T., et al.
Published: (2025)
by: Tran, Duong T., et al.
Published: (2025)
Check Field Detection Agent (CFD-Agent) using Multimodal Large Language and Vision Language Models
by: Halder, Sourav, et al.
Published: (2025)
by: Halder, Sourav, et al.
Published: (2025)
SURE-VQA: Systematic Understanding of Robustness Evaluation in Medical VQA Tasks
by: Kahl, Kim-Celine, et al.
Published: (2024)
by: Kahl, Kim-Celine, et al.
Published: (2024)
Knowledge Condensation and Reasoning for Knowledge-based VQA
by: Hao, Dongze, et al.
Published: (2024)
by: Hao, Dongze, et al.
Published: (2024)
MedFrameQA: A Multi-Image Medical VQA Benchmark for Clinical Reasoning
by: Yu, Suhao, et al.
Published: (2025)
by: Yu, Suhao, et al.
Published: (2025)
MicroVQA++: High-Quality Microscopy Reasoning Dataset with Weakly Supervised Graphs for Multimodal Large Language Model
by: Li, Manyu, et al.
Published: (2025)
by: Li, Manyu, et al.
Published: (2025)
HandVQA: Diagnosing and Improving Fine-Grained Spatial Reasoning about Hands in Vision-Language Models
by: Sayem, MD Khalequzzaman Chowdhury, et al.
Published: (2026)
by: Sayem, MD Khalequzzaman Chowdhury, et al.
Published: (2026)
Multi-Task Learning for Visually Grounded Reasoning in Gastrointestinal VQA
by: Safwan, Itbaan, et al.
Published: (2025)
by: Safwan, Itbaan, et al.
Published: (2025)
WorldVQA: Measuring Atomic World Knowledge in Multimodal Large Language Models
by: Zhou, Runjie, et al.
Published: (2026)
by: Zhou, Runjie, et al.
Published: (2026)
UNK-VQA: A Dataset and a Probe into the Abstention Ability of Multi-modal Large Models
by: Guo, Yangyang, et al.
Published: (2023)
by: Guo, Yangyang, et al.
Published: (2023)
Towards Efficient and Robust VQA-NLE Data Generation with Large Vision-Language Models
by: Irawan, Patrick Amadeus, et al.
Published: (2024)
by: Irawan, Patrick Amadeus, et al.
Published: (2024)
NegVQA: Can Vision Language Models Understand Negation?
by: Zhang, Yuhui, et al.
Published: (2025)
by: Zhang, Yuhui, et al.
Published: (2025)
Multimodal Retrieval-Augmented Generation with Large Language Models for Medical VQA
by: Karim, A H M Rezaul, et al.
Published: (2025)
by: Karim, A H M Rezaul, et al.
Published: (2025)
Muffin or Chihuahua? Challenging Multimodal Large Language Models with Multipanel VQA
by: Fan, Yue, et al.
Published: (2024)
by: Fan, Yue, et al.
Published: (2024)
Sakuga-42M Dataset: Scaling Up Cartoon Research
by: Pan, Zhenglin
Published: (2024)
by: Pan, Zhenglin
Published: (2024)
Toon3D: Seeing Cartoons from New Perspectives
by: Weber, Ethan, et al.
Published: (2024)
by: Weber, Ethan, et al.
Published: (2024)
Learning to Incorporate Texture Saliency Adaptive Attention to Image Cartoonization
by: Gao, Xiang, et al.
Published: (2022)
by: Gao, Xiang, et al.
Published: (2022)
Semantic-Preserving Cross-Style Visual Reasoning for Robust Multi-Modal Understanding in Large Vision-Language Models
by: Nakayama, Aya, et al.
Published: (2025)
by: Nakayama, Aya, et al.
Published: (2025)
VaseVQA: Multimodal Agent and Benchmark for Ancient Greek Pottery
by: Ge, Jinchao, et al.
Published: (2025)
by: Ge, Jinchao, et al.
Published: (2025)
CartoonAlive: Towards Expressive Live2D Modeling from Single Portraits
by: He, Chao, et al.
Published: (2025)
by: He, Chao, et al.
Published: (2025)
Think-as-You-See: Streaming Chain-of-Thought Reasoning for Large Vision-Language Models
by: Zhang, Jialiang, et al.
Published: (2026)
by: Zhang, Jialiang, et al.
Published: (2026)
DiN: Diffusion Model for Robust Medical VQA with Semantic Noisy Labels
by: Guo, Erjian, et al.
Published: (2025)
by: Guo, Erjian, et al.
Published: (2025)
PhysAnimator: Physics-Guided Generative Cartoon Animation
by: Xie, Tianyi, et al.
Published: (2025)
by: Xie, Tianyi, et al.
Published: (2025)
Enhancing Video Large Language Models with Structured Multi-Video Collaborative Reasoning
by: He, Zhihao, et al.
Published: (2025)
by: He, Zhihao, et al.
Published: (2025)
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
by: Monon, Mashrafi, et al.
Published: (2026)
by: Monon, Mashrafi, et al.
Published: (2026)
LENS: Multi-level Evaluation of Multimodal Reasoning with Large Language Models
by: Yao, Ruilin, et al.
Published: (2025)
by: Yao, Ruilin, et al.
Published: (2025)
IRIS: Intent Resolution via Inference-time Saccades for Open-Ended VQA in Large Vision-Language Models
by: Madinei, Parsa, et al.
Published: (2026)
by: Madinei, Parsa, et al.
Published: (2026)
Improving Automatic VQA Evaluation Using Large Language Models
by: Mañas, Oscar, et al.
Published: (2023)
by: Mañas, Oscar, et al.
Published: (2023)
Dynamic Orchestration of Multi-Agent System for Real-World Multi-Image Agricultural VQA
by: Ke, Yan, et al.
Published: (2025)
by: Ke, Yan, et al.
Published: (2025)
R^3-VQA: "Read the Room" by Video Social Reasoning
by: Niu, Lixing, et al.
Published: (2025)
by: Niu, Lixing, et al.
Published: (2025)
Similar Items
-
CRRG-CLIP: Automatic Generation of Chest Radiology Reports and Classification of Chest Radiographs
by: Xu, Jianfei, et al.
Published: (2024) -
Art2Music: Generating Music for Art Images with Multi-modal Feeling Alignment
by: Hong, Jiaying, et al.
Published: (2025) -
SurveillanceVQA-589K: A Benchmark for Comprehensive Surveillance Video-Language Understanding with Large Models
by: Liu, Bo, et al.
Published: (2025) -
MTabVQA: Evaluating Multi-Tabular Reasoning of Language Models in Visual Space
by: Singh, Anshul, et al.
Published: (2025) -
GRAM: Global Reasoning for Multi-Page VQA
by: Blau, Tsachi, et al.
Published: (2024)