:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Zhang, Yang, Biao, Liu, Qiang, Ma, Zhiyin, Zhang, Shuo, Yang, Jingxu, Sun, Yabo, Liu, Yuliang, Bai, Xiang
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2311.06607
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
by: Liu, Yuliang, et al.
Published: (2024)

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
by: Li, Zhang, et al.
Published: (2025)

Exploring the Capabilities of Large Multimodal Models on Dense Text
by: Zhang, Shuo, et al.
Published: (2024)

MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
by: Li, Zhang, et al.
Published: (2025)

Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
by: Huang, Mingxin, et al.
Published: (2024)

Liquid: Language Models are Scalable and Unified Multi-modal Generators
by: Wu, Junfeng, et al.
Published: (2024)

MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
by: Yin, Liang, et al.
Published: (2025)

OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023)

DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
by: Yu, Wenwen, et al.
Published: (2025)

LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
by: Tian, Mengxiao, et al.
Published: (2025)

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
by: Zhang, Jiarui, et al.
Published: (2025)

OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
by: Yu, Wenwen, et al.
Published: (2025)

Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning
by: Zhang, Zijian, et al.
Published: (2024)

Toward Real Text Manipulation Detection: New Dataset and New Solution
by: Luo, Dongliang, et al.
Published: (2023)

Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
by: Zhang, Kaichen, et al.
Published: (2024)

Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition
by: Yang, Mingkun, et al.
Published: (2024)

MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
by: Li, Zhang, et al.
Published: (2026)

Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition
by: Yang, Mingkun, et al.
Published: (2024)

Bridging the Gap Between End-to-End and Two-Step Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)

Text-Region Matching for Multi-Label Image Recognition with Missing Labels
by: Ma, Leilei, et al.
Published: (2024)

MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning
by: Wang, Tong, et al.
Published: (2025)

Training-free Geometric Image Editing on Diffusion Models
by: Zhu, Hanshen, et al.
Published: (2025)

Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
by: Ma, Zehong, et al.
Published: (2025)

The First Swahili Language Scene Text Detection and Recognition Dataset
by: Douamba, Fadila Wendigoundi, et al.
Published: (2024)

Q-Ground: Image Quality Grounding with Large Multi-modality Models
by: Chen, Chaofeng, et al.
Published: (2024)

SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
by: Luo, Dongliang, et al.
Published: (2025)

On the Multi-modal Vulnerability of Diffusion Models
by: Yang, Dingcheng, et al.
Published: (2024)

Progressive Evolution from Single-Point to Polygon for Scene Text
by: Deng, Linger, et al.
Published: (2023)

Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation
by: Jiang, Xixi, et al.
Published: (2024)

PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement
by: Wang, Zhijie, et al.
Published: (2024)

OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction
by: Cai, Zeyu, et al.
Published: (2026)

Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation
by: Zhu, Mengdan, et al.
Published: (2025)

Progressively Label Enhancement for Large Language Model Alignment
by: Liu, Biao, et al.
Published: (2024)

Novel Object Synthesis via Adaptive Text-Image Harmony
by: Xiong, Zeren, et al.
Published: (2024)

TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
by: Zhu, Hanshen, et al.
Published: (2026)

SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
by: Liu, Biao, et al.
Published: (2025)

PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
by: Zhang, Zheng, et al.
Published: (2024)

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
by: Xie, Xudong, et al.
Published: (2024)

Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
by: Zhang, Zhiwei, et al.
Published: (2023)

Improving Multi-modal Large Language Model through Boosting Vision Capabilities
by: Sun, Yanpeng, et al.
Published: (2024)