Saved in:
| Main Authors: | Fu, Ling, Kuang, Zhebin, Song, Jiajun, Huang, Mingxin, Yang, Biao, Li, Yuzhe, Zhu, Linghao, Luo, Qidi, Wang, Xinyu, Lu, Hao, Li, Zhang, Tang, Guozhi, Shan, Bin, Lin, Chunhui, Liu, Qi, Wu, Binghong, Feng, Hao, Liu, Hao, Huang, Can, Tang, Jingqun, Chen, Wei, Jin, Lianwen, Liu, Yuliang, Bai, Xiang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.00321 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023)
by: Liu, Yuliang, et al.
Published: (2023)
Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
by: Zhao, Zhen, et al.
Published: (2023)
by: Zhao, Zhen, et al.
Published: (2023)
Harmonizing Visual Text Comprehension and Generation
by: Zhao, Zhen, et al.
Published: (2024)
by: Zhao, Zhen, et al.
Published: (2024)
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
by: Feng, Hao, et al.
Published: (2023)
by: Feng, Hao, et al.
Published: (2023)
Bridging the Gap Between End-to-End and Two-Step Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)
by: Huang, Mingxin, et al.
Published: (2024)
Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
by: Feng, Hao, et al.
Published: (2025)
by: Feng, Hao, et al.
Published: (2025)
TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
by: Zhao, Weichao, et al.
Published: (2024)
by: Zhao, Weichao, et al.
Published: (2024)
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
by: Huang, Mingxin, et al.
Published: (2024)
by: Huang, Mingxin, et al.
Published: (2024)
TextSquare: Scaling up Text-Centric Visual Instruction Tuning
by: Tang, Jingqun, et al.
Published: (2024)
by: Tang, Jingqun, et al.
Published: (2024)
A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
by: Lu, Jinghui, et al.
Published: (2024)
by: Lu, Jinghui, et al.
Published: (2024)
VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
by: Liu, Yuliang, et al.
Published: (2024)
by: Liu, Yuliang, et al.
Published: (2024)
SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)
by: Huang, Mingxin, et al.
Published: (2024)
An open dataset for the evolution of oracle bone characters: EVOBC
by: Guan, Haisu, et al.
Published: (2024)
by: Guan, Haisu, et al.
Published: (2024)
Progressive Evolution from Single-Point to Polygon for Scene Text
by: Deng, Linger, et al.
Published: (2023)
by: Deng, Linger, et al.
Published: (2023)
MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
by: Shan, Bin, et al.
Published: (2024)
by: Shan, Bin, et al.
Published: (2024)
An open dataset for oracle bone script recognition and decipherment
by: Wang, Pengjie, et al.
Published: (2024)
by: Wang, Pengjie, et al.
Published: (2024)
WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
by: Wang, An-Lan, et al.
Published: (2025)
by: Wang, An-Lan, et al.
Published: (2025)
Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
by: Lu, Jinghui, et al.
Published: (2025)
by: Lu, Jinghui, et al.
Published: (2025)
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
by: Zhu, Hanshen, et al.
Published: (2026)
by: Zhu, Hanshen, et al.
Published: (2026)
MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
by: Tang, Jingqun, et al.
Published: (2024)
by: Tang, Jingqun, et al.
Published: (2024)
Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting
by: Feng, Hao, et al.
Published: (2026)
by: Feng, Hao, et al.
Published: (2026)
ParGo: Bridging Vision-Language with Partial and Global Views
by: Wang, An-Lan, et al.
Published: (2024)
by: Wang, An-Lan, et al.
Published: (2024)
The Effect of Continuous Casting Cooling Process on the Surface Quality of Low‐Nickel Austenitic Stainless Steel
by: Xianbang Dong, et al.
Published: (2025)
by: Xianbang Dong, et al.
Published: (2025)
Does Bank Going Public Affect the Borrowers' ESG Performance? Evidence From a Quasi‐Natural Experiment in China
by: Hao Huang, et al.
Published: (2026)
by: Hao Huang, et al.
Published: (2026)
Do environmental, social, and governance disclosure assurance reduce the cost of equity capital? Evidence from Chinese listed financial institutions
by: Hao Huang, et al.
Published: (2025)
by: Hao Huang, et al.
Published: (2025)
ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining
by: Peng, Dezhi, et al.
Published: (2023)
by: Peng, Dezhi, et al.
Published: (2023)
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering
by: Maryam, Hiba, et al.
Published: (2024)
by: Maryam, Hiba, et al.
Published: (2024)
Bifurcated Generative Flow Networks
by: Li, Chunhui, et al.
Published: (2024)
by: Li, Chunhui, et al.
Published: (2024)
Low Overhead Beam Alignment for Mobile Millimeter Channel Based on Continuous-Time Prediction
by: Lin, Huang-Chou, et al.
Published: (2023)
by: Lin, Huang-Chou, et al.
Published: (2023)
3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
by: Tang, Hao, et al.
Published: (2026)
by: Tang, Hao, et al.
Published: (2026)
Privacy-Preserving Biometric Verification with Handwritten Random Digit String
by: Zhang, Peirong, et al.
Published: (2025)
by: Zhang, Peirong, et al.
Published: (2025)
Advancing Sequential Numerical Prediction in Autoregressive Models
by: Fei, Xiang, et al.
Published: (2025)
by: Fei, Xiang, et al.
Published: (2025)
Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior
by: Wu, Yiqian, et al.
Published: (2024)
by: Wu, Yiqian, et al.
Published: (2024)
CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
by: Zhan, Yuliang, et al.
Published: (2026)
by: Zhan, Yuliang, et al.
Published: (2026)
L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
by: Zhan, Yuliang, et al.
Published: (2025)
by: Zhan, Yuliang, et al.
Published: (2025)
Hierarchical Side-Tuning for Vision Transformers
by: Lin, Weifeng, et al.
Published: (2023)
by: Lin, Weifeng, et al.
Published: (2023)
Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)
by: Luo, Chuwei, et al.
Published: (2022)
Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction
by: Wang, Pengjie, et al.
Published: (2024)
by: Wang, Pengjie, et al.
Published: (2024)
Deciphering Oracle Bone Language with Diffusion Models
by: Guan, Haisu, et al.
Published: (2024)
by: Guan, Haisu, et al.
Published: (2024)
WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark
by: Zhang, Chunhui, et al.
Published: (2024)
by: Zhang, Chunhui, et al.
Published: (2024)
Similar Items
-
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023) -
Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
by: Zhao, Zhen, et al.
Published: (2023) -
Harmonizing Visual Text Comprehension and Generation
by: Zhao, Zhen, et al.
Published: (2024) -
DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
by: Feng, Hao, et al.
Published: (2023) -
Bridging the Gap Between End-to-End and Two-Step Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)