:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Wei, Haoran, Sun, Yaofeng, Li, Yukun
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.20552
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

DeepSeek-OCR: Contexts Optical Compression
by: Wei, Haoran, et al.
Published: (2025)

Visual Merit or Linguistic Crutch? A Close Look at DeepSeek-OCR
by: Liang, Yunhao, et al.
Published: (2026)

Fine-tuning DeepSeek-OCR-2 for Molecular Structure Recognition
by: Tang, Haocheng, et al.
Published: (2026)

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
by: Wu, Zhiyu, et al.
Published: (2024)

DeepSeek on a Trip: Inducing Targeted Visual Hallucinations via Representation Vulnerabilities
by: Islam, Chashi Mahiul, et al.
Published: (2025)

Can LLMs Assist Computer Education? an Empirical Case Study of DeepSeek
by: Xiao, Dongfu, et al.
Published: (2025)

Seek-CAD: A Self-refined Generative Modeling for 3D Parametric CAD Using Local Inference via DeepSeek
by: Li, Xueyang, et al.
Published: (2025)

General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
by: Wei, Haoran, et al.
Published: (2024)

From ChatGPT to DeepSeek AI: A Comprehensive Analysis of Evolution, Deviation, and Future Implications in AI-Language Models
by: Singh, Simrandeep, et al.
Published: (2025)

MHA2MLA-VLM: Enabling DeepSeek's Economical Multi-Head Latent Attention across Vision-Language Models
by: Fan, Xiaoran, et al.
Published: (2026)

Can DeepSeek Reason Like a Surgeon? An Empirical Evaluation for Vision-Language Understanding in Robotic-Assisted Surgery
by: Ma, Boyi, et al.
Published: (2025)

DeepSeek-Inspired Exploration of RL-based LLMs and Synergy with Wireless Networks: A Survey
by: Qiao, Yu, et al.
Published: (2025)

Flow4Agent: Long-form Video Understanding via Motion Prior from Optical Flow
by: Liu, Ruyang, et al.
Published: (2025)

Comparative Analysis of OpenAI GPT-4o and DeepSeek R1 for Scientific Text Categorization Using Prompt Engineering
by: Maiti, Aniruddha, et al.
Published: (2025)

OmniForcing: Unleashing Real-time Joint Audio-Visual Generation
by: Su, Yaofeng, et al.
Published: (2026)

When Good OCR Is Not Enough: Benchmarking OCR Robustness for Retrieval-Augmented Generation
by: Sun, Lin, et al.
Published: (2026)

olmOCR 2: Unit Test Rewards for Document OCR
by: Poznanski, Jake, et al.
Published: (2025)

Hierarchical Visual Feature Aggregation for OCR-Free Document Understanding
by: Park, Jaeyoo, et al.
Published: (2024)

OCR-Agent: Agentic OCR with Capability and Memory Reflection
by: Wen, Shimin, et al.
Published: (2026)

OmniOCR: Generalist OCR for Ethnic Minority Languages
by: Liu, Bonan, et al.
Published: (2026)

Q-Mask: Query-driven Causal Masks for Text Anchoring in OCR-Oriented Vision-Language Models
by: Xu, Longwei, et al.
Published: (2026)

KG-ViP: Bridging Knowledge Grounding and Visual Perception in Multi-modal LLMs for Visual Question Answering
by: Li, Zhiyang, et al.
Published: (2026)

FinCriticalED: A Visual Benchmark for Financial Fact-Level OCR
by: He, Yueru, et al.
Published: (2025)

OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
by: Zhang, Junyuan, et al.
Published: (2024)

ABot-OCR Technical Report
by: Jiang, Kaitao, et al.
Published: (2026)

EFF-Grasp: Energy-Field Flow Matching for Physics-Aware Dexterous Grasp Generation
by: Zhao, Yukun, et al.
Published: (2026)

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
by: Shi, Yang, et al.
Published: (2025)

Seeking and Updating with Live Visual Knowledge
by: Fu, Mingyang, et al.
Published: (2025)

Ocean-OCR: Towards General OCR Application via a Vision-Language Model
by: Chen, Song, et al.
Published: (2025)

Context-Independent OCR with Multimodal LLMs: Effects of Image Resolution and Visual Complexity
by: Inoue, Kotaro
Published: (2025)

DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based Image Captioning
by: Xu, Dongsheng, et al.
Published: (2023)

FireRed-OCR Technical Report
by: Wu, Hao, et al.
Published: (2026)

Adversarial Training with OCR Modality Perturbation for Scene-Text Visual Question Answering
by: Shen, Zhixuan, et al.
Published: (2024)

Agentar-Fin-OCR
by: Qian, Siyi, et al.
Published: (2026)

CC-OCR: A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy
by: Yang, Zhibo, et al.
Published: (2024)

Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry
by: Kurt, Yunus Bilge, et al.
Published: (2024)

An Automated Deep Segmentation and Spatial-Statistics Approach for Post-Blast Rock Fragmentation Assessment
by: Yang, Yukun
Published: (2025)

The Devil is in the Details -- From OCR for Old Church Slavonic to Purely Visual Stemma Reconstruction
by: Hoenen, Armin
Published: (2026)

Counterfeit Answers: Adversarial Forgery against OCR-Free Document Visual Question Answering
by: Pintore, Marco, et al.
Published: (2025)

Flow to the Mode: Mode-Seeking Diffusion Autoencoders for State-of-the-Art Image Tokenization
by: Sargent, Kyle, et al.
Published: (2025)