:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fu, Ling, Kuang, Zhebin, Song, Jiajun, Huang, Mingxin, Yang, Biao, Li, Yuzhe, Zhu, Linghao, Luo, Qidi, Wang, Xinyu, Lu, Hao, Li, Zhang, Tang, Guozhi, Shan, Bin, Lin, Chunhui, Liu, Qi, Wu, Binghong, Feng, Hao, Liu, Hao, Huang, Can, Tang, Jingqun, Chen, Wei, Jin, Lianwen, Liu, Yuliang, Bai, Xiang
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2501.00321
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023)

Multi-modal In-Context Learning Makes an Ego-evolving Scene Text Recognizer
by: Zhao, Zhen, et al.
Published: (2023)

Harmonizing Visual Text Comprehension and Generation
by: Zhao, Zhen, et al.
Published: (2024)

DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding
by: Feng, Hao, et al.
Published: (2023)

Bridging the Gap Between End-to-End and Two-Step Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)

Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
by: Feng, Hao, et al.
Published: (2025)

TabPedia: Towards Comprehensive Visual Table Understanding with Concept Synergy
by: Zhao, Weichao, et al.
Published: (2024)

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
by: Huang, Mingxin, et al.
Published: (2024)

TextSquare: Scaling up Text-Centric Visual Instruction Tuning
by: Tang, Jingqun, et al.
Published: (2024)

A Bounding Box is Worth One Token: Interleaving Layout and Text in a Large Language Model for Document Understanding
by: Lu, Jinghui, et al.
Published: (2024)

VimTS: A Unified Video and Image Text Spotter for Enhancing the Cross-domain Generalization
by: Liu, Yuliang, et al.
Published: (2024)

SwinTextSpotter v2: Towards Better Synergy for Scene Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)

An open dataset for the evolution of oracle bone characters: EVOBC
by: Guan, Haisu, et al.
Published: (2024)

Progressive Evolution from Single-Point to Polygon for Scene Text
by: Deng, Linger, et al.
Published: (2023)

MCTBench: Multimodal Cognition towards Text-Rich Visual Scenes Benchmark
by: Shan, Bin, et al.
Published: (2024)

An open dataset for oracle bone script recognition and decipherment
by: Wang, Pengjie, et al.
Published: (2024)

WildDoc: How Far Are We from Achieving Comprehensive and Robust Document Understanding in the Wild?
by: Wang, An-Lan, et al.
Published: (2025)

Prolonged Reasoning Is Not All You Need: Certainty-Based Adaptive Routing for Efficient LLM/MLLM Reasoning
by: Lu, Jinghui, et al.
Published: (2025)

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
by: Zhu, Hanshen, et al.
Published: (2026)

MTVQA: Benchmarking Multilingual Text-Centric Visual Question Answering
by: Tang, Jingqun, et al.
Published: (2024)

Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting
by: Feng, Hao, et al.
Published: (2026)

ParGo: Bridging Vision-Language with Partial and Global Views
by: Wang, An-Lan, et al.
Published: (2024)

The Effect of Continuous Casting Cooling Process on the Surface Quality of Low‐Nickel Austenitic Stainless Steel
by: Xianbang Dong, et al.
Published: (2025)

Does Bank Going Public Affect the Borrowers' ESG Performance? Evidence From a Quasi‐Natural Experiment in China
by: Hao Huang, et al.
Published: (2026)

Do environmental, social, and governance disclosure assurance reduce the cost of equity capital? Evidence from Chinese listed financial institutions
by: Hao Huang, et al.
Published: (2025)

ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining
by: Peng, Dezhi, et al.
Published: (2023)

Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering
by: Maryam, Hiba, et al.
Published: (2024)

Bifurcated Generative Flow Networks
by: Li, Chunhui, et al.
Published: (2024)

Low Overhead Beam Alignment for Mobile Millimeter Channel Based on Continuous-Time Prediction
by: Lin, Huang-Chou, et al.
Published: (2023)

3D CoCa v2: Contrastive Learners with Test-Time Search for Generalizable Spatial Intelligence
by: Tang, Hao, et al.
Published: (2026)

Privacy-Preserving Biometric Verification with Handwritten Random Digit String
by: Zhang, Peirong, et al.
Published: (2025)

Advancing Sequential Numerical Prediction in Autoregressive Models
by: Fei, Xiang, et al.
Published: (2025)

Portrait3D: Text-Guided High-Quality 3D Portrait Generation Using Pyramid Representation and GANs Prior
by: Wu, Yiqian, et al.
Published: (2024)

CloDS: Visual-Only Unsupervised Cloth Dynamics Learning in Unknown Conditions
by: Zhan, Yuliang, et al.
Published: (2026)

L2V-CoT: Cross-Modal Transfer of Chain-of-Thought Reasoning via Latent Intervention
by: Zhan, Yuliang, et al.
Published: (2025)

Hierarchical Side-Tuning for Vision Transformers
by: Lin, Weifeng, et al.
Published: (2023)

Bi-VLDoc: Bidirectional Vision-Language Modeling for Visually-Rich Document Understanding
by: Luo, Chuwei, et al.
Published: (2022)

Puzzle Pieces Picker: Deciphering Ancient Chinese Characters with Radical Reconstruction
by: Wang, Pengjie, et al.
Published: (2024)

Deciphering Oracle Bone Language with Diffusion Models
by: Guan, Haisu, et al.
Published: (2024)

WebUOT-1M: Advancing Deep Underwater Object Tracking with A Million-Scale Benchmark
by: Zhang, Chunhui, et al.
Published: (2024)