Saved in:
| Main Authors: | Qian, Kun, Li, Wenjie, Sun, Tianyu, Wang, Wenhong, Luo, Wenhan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2508.07021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
by: Qian, Kun, et al.
Published: (2024)
by: Qian, Kun, et al.
Published: (2024)
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
by: Yu, Wenwen, et al.
Published: (2025)
by: Yu, Wenwen, et al.
Published: (2025)
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
by: Zhang, Jiaxin, et al.
Published: (2024)
by: Zhang, Jiaxin, et al.
Published: (2024)
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
by: Feng, Hao, et al.
Published: (2023)
by: Feng, Hao, et al.
Published: (2023)
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
by: Xia, Renqiu, et al.
Published: (2024)
by: Xia, Renqiu, et al.
Published: (2024)
Optimizing Psychological Counseling with Instruction-Tuned Large Language Models
by: Li, Wenjie, et al.
Published: (2024)
by: Li, Wenjie, et al.
Published: (2024)
DocSLM: A Small Vision-Language Model for Long Multimodal Document Understanding
by: Hannan, Tanveer, et al.
Published: (2025)
by: Hannan, Tanveer, et al.
Published: (2025)
MeDocVL: A Visual Language Model for Medical Document Understanding and Parsing
by: Wang, Wenjie, et al.
Published: (2026)
by: Wang, Wenjie, et al.
Published: (2026)
DocLens : A Tool-Augmented Multi-Agent Framework for Long Visual Document Understanding
by: Zhu, Dawei, et al.
Published: (2025)
by: Zhu, Dawei, et al.
Published: (2025)
SimpleDoc: Multi-Modal Document Understanding with Dual-Cue Page Retrieval and Iterative Refinement
by: Jain, Chelsi, et al.
Published: (2025)
by: Jain, Chelsi, et al.
Published: (2025)
SynthDoc: Bilingual Documents Synthesis for Visual Document Understanding
by: Ding, Chuanghao, et al.
Published: (2024)
by: Ding, Chuanghao, et al.
Published: (2024)
MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
by: Chen, Ketong, et al.
Published: (2025)
by: Chen, Ketong, et al.
Published: (2025)
DocScope: Benchmarking Verifiable Reasoning for Trustworthy Long-Document Understanding
by: Feng, Xiang, et al.
Published: (2026)
by: Feng, Xiang, et al.
Published: (2026)
ShaDocFormer: A Shadow-Attentive Threshold Detector With Cascaded Fusion Refiner for Document Shadow Removal
by: Chen, Weiwen, et al.
Published: (2023)
by: Chen, Weiwen, et al.
Published: (2023)
DocShaDiffusion: Diffusion Model in Latent Space for Document Image Shadow Removal
by: Liu, Wenjie, et al.
Published: (2025)
by: Liu, Wenjie, et al.
Published: (2025)
Object Recognition from Scientific Document based on Compartment Refinement Framework
by: Li, Jinghong, et al.
Published: (2023)
by: Li, Jinghong, et al.
Published: (2023)
DocLayLLM: An Efficient Multi-modal Extension of Large Language Models for Text-rich Document Understanding
by: Liao, Wenhui, et al.
Published: (2024)
by: Liao, Wenhui, et al.
Published: (2024)
PP-DocBee: Improving Multimodal Document Understanding Through a Bag of Tricks
by: Ni, Feng, et al.
Published: (2025)
by: Ni, Feng, et al.
Published: (2025)
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
by: Huang, Kui, et al.
Published: (2025)
by: Huang, Kui, et al.
Published: (2025)
MMLongBench-Doc: Benchmarking Long-context Document Understanding with Visualizations
by: Ma, Yubo, et al.
Published: (2024)
by: Ma, Yubo, et al.
Published: (2024)
Large Content And Behavior Models To Understand, Simulate, And Optimize Content And Behavior
by: Khandelwal, Ashmit, et al.
Published: (2023)
by: Khandelwal, Ashmit, et al.
Published: (2023)
DocKD: Knowledge Distillation from LLMs for Open-World Document Understanding Models
by: Kim, Sungnyun, et al.
Published: (2024)
by: Kim, Sungnyun, et al.
Published: (2024)
ReMoRa: Multimodal Large Language Model based on Refined Motion Representation for Long-Video Understanding
by: Yashima, Daichi, et al.
Published: (2026)
by: Yashima, Daichi, et al.
Published: (2026)
DocAtlas: Multilingual Document Understanding Across 80+ Languages
by: Heakl, Ahmed, et al.
Published: (2026)
by: Heakl, Ahmed, et al.
Published: (2026)
DocDeshadower: Frequency-Aware Transformer for Document Shadow Removal
by: Zhou, Ziyang, et al.
Published: (2023)
by: Zhou, Ziyang, et al.
Published: (2023)
DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding
by: Xiong, Junyu, et al.
Published: (2025)
by: Xiong, Junyu, et al.
Published: (2025)
DocCogito: Aligning Layout Cognition and Step-Level Grounded Reasoning for Document Understanding
by: Wu, Yuchuan, et al.
Published: (2026)
by: Wu, Yuchuan, et al.
Published: (2026)
PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction
by: Sun, Ting, et al.
Published: (2025)
by: Sun, Ting, et al.
Published: (2025)
Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning
by: Mo, Ye, et al.
Published: (2025)
by: Mo, Ye, et al.
Published: (2025)
Cross-Lingual SynthDocs: A Large-Scale Synthetic Corpus for Any to Arabic OCR and Document Understanding
by: Al-Homoud, Haneen, et al.
Published: (2025)
by: Al-Homoud, Haneen, et al.
Published: (2025)
DocVideoQA: Towards Comprehensive Understanding of Document-Centric Videos through Question Answering
by: Wang, Haochen, et al.
Published: (2025)
by: Wang, Haochen, et al.
Published: (2025)
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
by: Wang, An-Lan, et al.
Published: (2025)
by: Wang, An-Lan, et al.
Published: (2025)
DocRetriever: A Plug-and-Play Framework for Multimodal Document Retrieval with Comprehensive Benchmark
by: Hu, Ruofan, et al.
Published: (2026)
by: Hu, Ruofan, et al.
Published: (2026)
A Versatile Multimodal Agent for Multimedia Content Generation
by: Zhang, Daoan, et al.
Published: (2026)
by: Zhang, Daoan, et al.
Published: (2026)
CogDoc: Towards Unified thinking in Documents
by: Xu, Qixin, et al.
Published: (2025)
by: Xu, Qixin, et al.
Published: (2025)
Q-Doc: Benchmarking Document Image Quality Assessment Capabilities in Multi-modal Large Language Models
by: Huang, Jiaxi, et al.
Published: (2025)
by: Huang, Jiaxi, et al.
Published: (2025)
OrthoDoc: Multimodal Large Language Model for Assisting Diagnosis in Computed Tomography
by: Jin, Youzhu, et al.
Published: (2024)
by: Jin, Youzhu, et al.
Published: (2024)
InstructDoc: A Dataset for Zero-Shot Generalization of Visual Document Understanding with Instructions
by: Tanaka, Ryota, et al.
Published: (2024)
by: Tanaka, Ryota, et al.
Published: (2024)
DocVCE: Diffusion-based Visual Counterfactual Explanations for Document Image Classification
by: Saifullah, Saifullah, et al.
Published: (2025)
by: Saifullah, Saifullah, et al.
Published: (2025)
mPLUG-DocOwl 1.5: Unified Structure Learning for OCR-free Document Understanding
by: Hu, Anwen, et al.
Published: (2024)
by: Hu, Anwen, et al.
Published: (2024)
Similar Items
-
Exploring Large Vision-Language Models for Robust and Efficient Industrial Anomaly Detection
by: Qian, Kun, et al.
Published: (2024) -
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
by: Yu, Wenwen, et al.
Published: (2025) -
DocKylin: A Large Multimodal Model for Visual Document Understanding with Efficient Visual Slimming
by: Zhang, Jiaxin, et al.
Published: (2024) -
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
by: Feng, Hao, et al.
Published: (2023) -
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models
by: Xia, Renqiu, et al.
Published: (2024)