Saved in:
| Main Authors: | Gong, Weile, Zuo, Yiping, Lu, Zijian, He, Xin, Fan, Weibei, Qi, Lianyong, Jin, Shi |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.19790 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
by: Lu, Zijian, et al.
Published: (2026)
by: Lu, Zijian, et al.
Published: (2026)
Ocean-OCR: Towards General OCR Application via a Vision-Language Model
by: Chen, Song, et al.
Published: (2025)
by: Chen, Song, et al.
Published: (2025)
Fluid Antenna-enabled Integrated Sensing, Communication, and Computing Systems
by: Zuo, Yiping, et al.
Published: (2024)
by: Zuo, Yiping, et al.
Published: (2024)
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
by: Zhou, Zijian, et al.
Published: (2023)
by: Zhou, Zijian, et al.
Published: (2023)
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
by: Momayiz, Imane, et al.
Published: (2026)
by: Momayiz, Imane, et al.
Published: (2026)
VLIPP: Towards Physically Plausible Video Generation with Vision and Language Informed Physical Prior
by: Yang, Xindi, et al.
Published: (2025)
by: Yang, Xindi, et al.
Published: (2025)
From Seeing to Predicting: A Vision-Language Framework for Trajectory Forecasting and Controlled Video Generation
by: Yang, Fan, et al.
Published: (2025)
by: Yang, Fan, et al.
Published: (2025)
Error Patterns in Historical OCR: A Comparative Analysis of TrOCR and a Vision-Language Model
by: Vesalainen, Ari, et al.
Published: (2026)
by: Vesalainen, Ari, et al.
Published: (2026)
THOM: Generating Physically Plausible Hand-Object Meshes From Text
by: Jeong, Uyoung, et al.
Published: (2026)
by: Jeong, Uyoung, et al.
Published: (2026)
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR
by: Taghadouini, Said, et al.
Published: (2026)
by: Taghadouini, Said, et al.
Published: (2026)
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models
by: Zhong, Yufeng, et al.
Published: (2026)
by: Zhong, Yufeng, et al.
Published: (2026)
Generate, but Verify: Reducing Hallucination in Vision-Language Models with Retrospective Resampling
by: Wu, Tsung-Han, et al.
Published: (2025)
by: Wu, Tsung-Han, et al.
Published: (2025)
SCAResNet: A ResNet Variant Optimized for Tiny Object Detection in Transmission and Distribution Towers
by: Li, Weile, et al.
Published: (2024)
by: Li, Weile, et al.
Published: (2024)
Vision-Centric Activation and Coordination for Multimodal Large Language Models
by: Wang, Yunnan, et al.
Published: (2025)
by: Wang, Yunnan, et al.
Published: (2025)
Baseer: A Vision-Language Model for Arabic Document-to-Markdown OCR
by: Hennara, Khalil, et al.
Published: (2025)
by: Hennara, Khalil, et al.
Published: (2025)
Text Promptable Surgical Instrument Segmentation with Vision-Language Models
by: Zhou, Zijian, et al.
Published: (2023)
by: Zhou, Zijian, et al.
Published: (2023)
Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
by: Xu, Longwei, et al.
Published: (2026)
by: Xu, Longwei, et al.
Published: (2026)
DianJin-OCR-R1: Enhancing OCR Capabilities via a Reasoning-and-Tool Interleaved Vision-Language Model
by: Chen, Qian, et al.
Published: (2025)
by: Chen, Qian, et al.
Published: (2025)
Vision Foundation Models as Generalist Tokenizers for Image Generation
by: Zheng, Anlin, et al.
Published: (2026)
by: Zheng, Anlin, et al.
Published: (2026)
PAMD: Plausibility-Aware Motion Diffusion Model for Long Dance Generation
by: Wang, Hongsong, et al.
Published: (2025)
by: Wang, Hongsong, et al.
Published: (2025)
OrthoPhys: Physically Plausible Video Generation with Orthogonal-View Geometry Guidance
by: Wang, Cong, et al.
Published: (2026)
by: Wang, Cong, et al.
Published: (2026)
DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge
by: Zhang, Wenyao, et al.
Published: (2025)
by: Zhang, Wenyao, et al.
Published: (2025)
Action Draft and Verify: A Self-Verifying Framework for Vision-Language-Action Model
by: Zhao, Chen, et al.
Published: (2026)
by: Zhao, Chen, et al.
Published: (2026)
TextHawk2: A Large Vision-Language Model Excels in Bilingual OCR and Grounding with 16x Fewer Tokens
by: Yu, Ya-Qi, et al.
Published: (2024)
by: Yu, Ya-Qi, et al.
Published: (2024)
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
by: Zhang, Junyuan, et al.
Published: (2024)
by: Zhang, Junyuan, et al.
Published: (2024)
OmniOCR: Generalist OCR for Ethnic Minority Languages
by: Liu, Bonan, et al.
Published: (2026)
by: Liu, Bonan, et al.
Published: (2026)
From Generated Human Videos to Physically Plausible Robot Trajectories
by: Ni, James, et al.
Published: (2025)
by: Ni, James, et al.
Published: (2025)
HMVLM: Multistage Reasoning-Enhanced Vision-Language Model for Long-Tailed Driving Scenarios
by: Wang, Daming, et al.
Published: (2025)
by: Wang, Daming, et al.
Published: (2025)
Finetuning Vision-Language Models as OCR Systems for Low-Resource Languages: A Case Study of Manchu
by: Chung, Yan Hon Michael, et al.
Published: (2025)
by: Chung, Yan Hon Michael, et al.
Published: (2025)
FantasyVLN: Unified Multimodal Chain-of-Thought Reasoning for Vision-Language Navigation
by: Zuo, Jing, et al.
Published: (2026)
by: Zuo, Jing, et al.
Published: (2026)
MMPhysVideo: Scaling Physical Plausibility in Video Generation via Joint Multimodal Modeling
by: Lin, Shubo, et al.
Published: (2026)
by: Lin, Shubo, et al.
Published: (2026)
PreP-OCR: A Complete Pipeline for Document Image Restoration and Enhanced OCR Accuracy
by: Guan, Shuhao, et al.
Published: (2025)
by: Guan, Shuhao, et al.
Published: (2025)
Vision-Language Models for Vision Tasks: A Survey
by: Zhang, Jingyi, et al.
Published: (2023)
by: Zhang, Jingyi, et al.
Published: (2023)
Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models
by: Lu, Zhihe, et al.
Published: (2023)
by: Lu, Zhihe, et al.
Published: (2023)
MorphGen: Controllable and Morphologically Plausible Generative Cell-Imaging
by: Demirel, Berker, et al.
Published: (2025)
by: Demirel, Berker, et al.
Published: (2025)
MedGround: Bridging the Evidence Gap in Medical Vision-Language Models with Verified Grounding Data
by: Zhang, Mengmeng, et al.
Published: (2026)
by: Zhang, Mengmeng, et al.
Published: (2026)
Seeing is Believing? Mitigating OCR Hallucinations in Multimodal Large Language Models
by: He, Zhentao, et al.
Published: (2025)
by: He, Zhentao, et al.
Published: (2025)
Fréchet Denoised Distance: Enhancing Plausibility Evaluation for Generated Designs with Denoising Autoencoder
by: Fan, Jiajie, et al.
Published: (2024)
by: Fan, Jiajie, et al.
Published: (2024)
OCR-Agent: Agentic OCR with Capability and Memory Reflection
by: Wen, Shimin, et al.
Published: (2026)
by: Wen, Shimin, et al.
Published: (2026)
Efficient and Long-Tailed Generalization for Pre-trained Vision-Language Model
by: Shi, Jiang-Xin, et al.
Published: (2024)
by: Shi, Jiang-Xin, et al.
Published: (2024)
Similar Items
-
ContractSkill: Repairable Contract-Based Skills for Multimodal Web Agents
by: Lu, Zijian, et al.
Published: (2026) -
Ocean-OCR: Towards General OCR Application via a Vision-Language Model
by: Chen, Song, et al.
Published: (2025) -
Fluid Antenna-enabled Integrated Sensing, Communication, and Computing Systems
by: Zuo, Yiping, et al.
Published: (2024) -
VLPrompt: Vision-Language Prompting for Panoptic Scene Graph Generation
by: Zhou, Zijian, et al.
Published: (2023) -
AtlasOCR: Building the First Open-Source Darija OCR Model with Vision Language Models
by: Momayiz, Imane, et al.
Published: (2026)