Saved in:
| Main Authors: | Li, Zhang, Yang, Biao, Liu, Qiang, Ma, Zhiyin, Zhang, Shuo, Yang, Jingxu, Sun, Yabo, Liu, Yuliang, Bai, Xiang |
|---|---|
| Format: | Preprint |
| Published: |
2023
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2311.06607 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
by: Liu, Yuliang, et al.
Published: (2024)
by: Liu, Yuliang, et al.
Published: (2024)
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
by: Li, Zhang, et al.
Published: (2025)
by: Li, Zhang, et al.
Published: (2025)
Exploring the Capabilities of Large Multimodal Models on Dense Text
by: Zhang, Shuo, et al.
Published: (2024)
by: Zhang, Shuo, et al.
Published: (2024)
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
by: Li, Zhang, et al.
Published: (2025)
by: Li, Zhang, et al.
Published: (2025)
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
by: Huang, Mingxin, et al.
Published: (2024)
by: Huang, Mingxin, et al.
Published: (2024)
Liquid: Language Models are Scalable and Unified Multi-modal Generators
by: Wu, Junfeng, et al.
Published: (2024)
by: Wu, Junfeng, et al.
Published: (2024)
MSTAR: Box-free Multi-query Scene Text Retrieval with Attention Recycling
by: Yin, Liang, et al.
Published: (2025)
by: Yin, Liang, et al.
Published: (2025)
OCRBench: On the Hidden Mystery of OCR in Large Multimodal Models
by: Liu, Yuliang, et al.
Published: (2023)
by: Liu, Yuliang, et al.
Published: (2023)
DocThinker: Explainable Multimodal Large Language Models with Rule-based Reinforcement Learning for Document Understanding
by: Yu, Wenwen, et al.
Published: (2025)
by: Yu, Wenwen, et al.
Published: (2025)
LLM-enhanced Action-aware Multi-modal Prompt Tuning for Image-Text Matching
by: Tian, Mengxiao, et al.
Published: (2025)
by: Tian, Mengxiao, et al.
Published: (2025)
MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns
by: Zhang, Jiarui, et al.
Published: (2025)
by: Zhang, Jiarui, et al.
Published: (2025)
OmniParser V2: Structured-Points-of-Thought for Unified Visual Text Parsing and Its Generality to Multimodal Large Language Models
by: Yu, Wenwen, et al.
Published: (2025)
by: Yu, Wenwen, et al.
Published: (2025)
Integrating Text and Image Pre-training for Multi-modal Algorithmic Reasoning
by: Zhang, Zijian, et al.
Published: (2024)
by: Zhang, Zijian, et al.
Published: (2024)
Toward Real Text Manipulation Detection: New Dataset and New Solution
by: Luo, Dongliang, et al.
Published: (2023)
by: Luo, Dongliang, et al.
Published: (2023)
Large Multi-modal Models Can Interpret Features in Large Multi-modal Models
by: Zhang, Kaichen, et al.
Published: (2024)
by: Zhang, Kaichen, et al.
Published: (2024)
Sequential Visual and Semantic Consistency for Semi-supervised Text Recognition
by: Yang, Mingkun, et al.
Published: (2024)
by: Yang, Mingkun, et al.
Published: (2024)
MDPBench: A Benchmark for Multilingual Document Parsing in Real-World Scenarios
by: Li, Zhang, et al.
Published: (2026)
by: Li, Zhang, et al.
Published: (2026)
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition
by: Yang, Mingkun, et al.
Published: (2024)
by: Yang, Mingkun, et al.
Published: (2024)
Bridging the Gap Between End-to-End and Two-Step Text Spotting
by: Huang, Mingxin, et al.
Published: (2024)
by: Huang, Mingxin, et al.
Published: (2024)
Text-Region Matching for Multi-Label Image Recognition with Missing Labels
by: Ma, Leilei, et al.
Published: (2024)
by: Ma, Leilei, et al.
Published: (2024)
MSSDF: Modality-Shared Self-supervised Distillation for High-Resolution Multi-modal Remote Sensing Image Learning
by: Wang, Tong, et al.
Published: (2025)
by: Wang, Tong, et al.
Published: (2025)
Training-free Geometric Image Editing on Diffusion Models
by: Zhu, Hanshen, et al.
Published: (2025)
by: Zhu, Hanshen, et al.
Published: (2025)
Multi-modal Reference Learning for Fine-grained Text-to-Image Retrieval
by: Ma, Zehong, et al.
Published: (2025)
by: Ma, Zehong, et al.
Published: (2025)
The First Swahili Language Scene Text Detection and Recognition Dataset
by: Douamba, Fadila Wendigoundi, et al.
Published: (2024)
by: Douamba, Fadila Wendigoundi, et al.
Published: (2024)
Q-Ground: Image Quality Grounding with Large Multi-modality Models
by: Chen, Chaofeng, et al.
Published: (2024)
by: Chen, Chaofeng, et al.
Published: (2024)
SemiETS: Integrating Spatial and Content Consistencies for Semi-Supervised End-to-end Text Spotting
by: Luo, Dongliang, et al.
Published: (2025)
by: Luo, Dongliang, et al.
Published: (2025)
On the Multi-modal Vulnerability of Diffusion Models
by: Yang, Dingcheng, et al.
Published: (2024)
by: Yang, Dingcheng, et al.
Published: (2024)
Progressive Evolution from Single-Point to Polygon for Scene Text
by: Deng, Linger, et al.
Published: (2023)
by: Deng, Linger, et al.
Published: (2023)
Labeled-to-Unlabeled Distribution Alignment for Partially-Supervised Multi-Organ Medical Image Segmentation
by: Jiang, Xixi, et al.
Published: (2024)
by: Jiang, Xixi, et al.
Published: (2024)
PromptCharm: Text-to-Image Generation through Multi-modal Prompting and Refinement
by: Wang, Zhijie, et al.
Published: (2024)
by: Wang, Zhijie, et al.
Published: (2024)
OmniFit: Multi-modal 3D Body Fitting via Scale-agnostic Dense Landmark Prediction
by: Cai, Zeyu, et al.
Published: (2026)
by: Cai, Zeyu, et al.
Published: (2026)
Cross-modal RAG: Sub-dimensional Text-to-Image Retrieval-Augmented Generation
by: Zhu, Mengdan, et al.
Published: (2025)
by: Zhu, Mengdan, et al.
Published: (2025)
Progressively Label Enhancement for Large Language Model Alignment
by: Liu, Biao, et al.
Published: (2024)
by: Liu, Biao, et al.
Published: (2024)
Novel Object Synthesis via Adaptive Text-Image Harmony
by: Xiong, Zeren, et al.
Published: (2024)
by: Xiong, Zeren, et al.
Published: (2024)
TextPecker: Rewarding Structural Anomaly Quantification for Enhancing Visual Text Rendering
by: Zhu, Hanshen, et al.
Published: (2026)
by: Zhu, Hanshen, et al.
Published: (2026)
SpatialLock: Precise Spatial Control in Text-to-Image Synthesis
by: Liu, Biao, et al.
Published: (2025)
by: Liu, Biao, et al.
Published: (2025)
PSALM: Pixelwise SegmentAtion with Large Multi-Modal Model
by: Zhang, Zheng, et al.
Published: (2024)
by: Zhang, Zheng, et al.
Published: (2024)
PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling
by: Xie, Xudong, et al.
Published: (2024)
by: Xie, Xudong, et al.
Published: (2024)
Accountable Textual-Visual Chat Learns to Reject Human Instructions in Image Re-creation
by: Zhang, Zhiwei, et al.
Published: (2023)
by: Zhang, Zhiwei, et al.
Published: (2023)
Improving Multi-modal Large Language Model through Boosting Vision Capabilities
by: Sun, Yanpeng, et al.
Published: (2024)
by: Sun, Yanpeng, et al.
Published: (2024)
Similar Items
-
TextMonkey: An OCR-Free Large Multimodal Model for Understanding Document
by: Liu, Yuliang, et al.
Published: (2024) -
LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance
by: Li, Zhang, et al.
Published: (2025) -
Exploring the Capabilities of Large Multimodal Models on Dense Text
by: Zhang, Shuo, et al.
Published: (2024) -
MonkeyOCR: Document Parsing with a Structure-Recognition-Relation Triplet Paradigm
by: Li, Zhang, et al.
Published: (2025) -
Mini-Monkey: Alleviating the Semantic Sawtooth Effect for Lightweight MLLMs via Complementary Image Pyramid
by: Huang, Mingxin, et al.
Published: (2024)