Saved in:
Bibliographic Details
Main Authors: Feng, Hao, Wei, Shu, Fei, Xiang, Shi, Wei, Han, Yingdong, Liao, Lei, Lu, Jinghui, Wu, Binghong, Liu, Qi, Lin, Chunhui, Tang, Jingqun, Liu, Hao, Huang, Can
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.14059
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin