Saved in:
Bibliographic Details
Main Authors: Feng, Hao, Wei, Shu, Fei, Xiang, Shi, Wei, Han, Yingdong, Liao, Lei, Lu, Jinghui, Wu, Binghong, Liu, Qi, Lin, Chunhui, Tang, Jingqun, Liu, Hao, Huang, Can
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2505.14059
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912383138529280
author Feng, Hao
Wei, Shu
Fei, Xiang
Shi, Wei
Han, Yingdong
Liao, Lei
Lu, Jinghui
Wu, Binghong
Liu, Qi
Lin, Chunhui
Tang, Jingqun
Liu, Hao
Huang, Can
author_facet Feng, Hao
Wei, Shu
Fei, Xiang
Shi, Wei
Han, Yingdong
Liao, Lei
Lu, Jinghui
Wu, Binghong
Liu, Qi
Lin, Chunhui
Tang, Jingqun
Liu, Hao
Huang, Can
contents Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
format Preprint
id arxiv_https___arxiv_org_abs_2505_14059
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
Feng, Hao
Wei, Shu
Fei, Xiang
Shi, Wei
Han, Yingdong
Liao, Lei
Lu, Jinghui
Wu, Binghong
Liu, Qi
Lin, Chunhui
Tang, Jingqun
Liu, Hao
Huang, Can
Computer Vision and Pattern Recognition
Document image parsing is challenging due to its complexly intertwined elements such as text paragraphs, figures, formulas, and tables. Current approaches either assemble specialized expert models or directly generate page-level content autoregressively, facing integration overhead, efficiency bottlenecks, and layout structure degradation despite their decent performance. To address these limitations, we present \textit{Dolphin} (\textit{\textbf{Do}cument Image \textbf{P}arsing via \textbf{H}eterogeneous Anchor Prompt\textbf{in}g}), a novel multimodal document image parsing model following an analyze-then-parse paradigm. In the first stage, Dolphin generates a sequence of layout elements in reading order. These heterogeneous elements, serving as anchors and coupled with task-specific prompts, are fed back to Dolphin for parallel content parsing in the second stage. To train Dolphin, we construct a large-scale dataset of over 30 million samples, covering multi-granularity parsing tasks. Through comprehensive evaluations on both prevalent benchmarks and self-constructed ones, Dolphin achieves state-of-the-art performance across diverse page-level and element-level settings, while ensuring superior efficiency through its lightweight architecture and parallel parsing mechanism. The code and pre-trained models are publicly available at https://github.com/ByteDance/Dolphin
title Dolphin: Document Image Parsing via Heterogeneous Anchor Prompting
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2505.14059