:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Qian, Kun, Li, Wenjie, Sun, Tianyu, Wang, Wenhong, Luo, Wenhan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2508.07021
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
by: Qian, Kun, et al.
Published: (2024)

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
by: Yu, Wenwen, et al.
Published: (2025)

DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
by: Zhang, Jiaxin, et al.
Published: (2024)

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
by: Feng, Hao, et al.
Published: (2023)

DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
by: Xia, Renqiu, et al.
Published: (2024)

Optimizing Psychological Counseling with Instruction-Tuned Large Language Models
by: Li, Wenjie, et al.
Published: (2024)

DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)

MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
by: Wang, Wenjie, et al.
Published: (2026)

DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
by: Zhu, Dawei, et al.
Published: (2025)

SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
by: Jain, Chelsi, et al.
Published: (2025)

SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
by: Ding, Chuanghao, et al.
Published: (2024)

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
by: Chen, Ketong, et al.
Published: (2025)

DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
by: Feng, Xiang, et al.
Published: (2026)

ShaDocFormer: A Shadow-Attentive Threshold Detector With Cascaded Fusion Refiner for Document Shadow Removal
by: Chen, Weiwen, et al.
Published: (2023)

DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal
by: Liu, Wenjie, et al.
Published: (2025)

Object Recognition from Scientific Document based on Compartment Refinement Framework
by: Li, Jinghong, et al.
Published: (2023)

DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
by: Liao, Wenhui, et al.
Published: (2024)

PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
by: Ni, Feng, et al.
Published: (2025)

PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
by: Huang, Kui, et al.
Published: (2025)

MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
by: Ma, Yubo, et al.
Published: (2024)

Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
by: Khandelwal, Ashmit, et al.
Published: (2023)

DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
by: Kim, Sungnyun, et al.
Published: (2024)

ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
by: Yashima, Daichi, et al.
Published: (2026)

DocAtlas: Multilingual Document Understanding Across 80+ Languages
by: Heakl, Ahmed, et al.
Published: (2026)

DocDeshadower: Frequency-Aware Transformer for Document Shadow Removal
by: Zhou, Ziyang, et al.
Published: (2023)

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding
by: Xiong, Junyu, et al.
Published: (2025)

DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding
by: Wu, Yuchuan, et al.
Published: (2026)

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction
by: Sun, Ting, et al.
Published: (2025)

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning
by: Mo, Ye, et al.
Published: (2025)

Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding
by: Al-Homoud, Haneen, et al.
Published: (2025)

DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
by: Wang, Haochen, et al.
Published: (2025)

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
by: Wang, An-Lan, et al.
Published: (2025)

DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
by: Hu, Ruofan, et al.
Published: (2026)

A Versatile Multimodal Agent for Multimedia Content Generation
by: Zhang, Daoan, et al.
Published: (2026)

CogDoc: Towards Unified thinking in Documents
by: Xu, Qixin, et al.
Published: (2025)

Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
by: Huang, Jiaxi, et al.
Published: (2025)

OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography
by: Jin, Youzhu, et al.
Published: (2024)

InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
by: Tanaka, Ryota, et al.
Published: (2024)

DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification
by: Saifullah, Saifullah, et al.
Published: (2025)

mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
by: Hu, Anwen, et al.
Published: (2024)