Salvato in:
Dettagli Bibliografici
Autori principali: Fu, Ling, Kuang, Zhebin, Song, Jiajun, Huang, Mingxin, Yang, Biao, Li, Yuzhe, Zhu, Linghao, Luo, Qidi, Wang, Xinyu, Lu, Hao, Li, Zhang, Tang, Guozhi, Shan, Bin, Lin, Chunhui, Liu, Qi, Wu, Binghong, Feng, Hao, Liu, Hao, Huang, Can, Tang, Jingqun, Chen, Wei, Jin, Lianwen, Liu, Yuliang, Bai, Xiang
Natura: Preprint
Pubblicazione: 2024
Soggetti:
Accesso online:https://arxiv.org/abs/2501.00321
Tags: Aggiungi Tag
Nessun Tag, puoi essere il primo ad aggiungerne!!
_version_ 1866916780008538112
author Fu, Ling
Kuang, Zhebin
Song, Jiajun
Huang, Mingxin
Yang, Biao
Li, Yuzhe
Zhu, Linghao
Luo, Qidi
Wang, Xinyu
Lu, Hao
Li, Zhang
Tang, Guozhi
Shan, Bin
Lin, Chunhui
Liu, Qi
Wu, Binghong
Feng, Hao
Liu, Hao
Huang, Can
Tang, Jingqun
Chen, Wei
Jin, Lianwen
Liu, Yuliang
Bai, Xiang
author_facet Fu, Ling
Kuang, Zhebin
Song, Jiajun
Huang, Mingxin
Yang, Biao
Li, Yuzhe
Zhu, Linghao
Luo, Qidi
Wang, Xinyu
Lu, Hao
Li, Zhang
Tang, Guozhi
Shan, Bin
Lin, Chunhui
Liu, Qi
Wu, Binghong
Feng, Hao
Liu, Hao
Huang, Can
Tang, Jingqun
Chen, Wei
Jin, Lianwen
Liu, Yuliang
Bai, Xiang
contents Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/
format Preprint
id arxiv_https___arxiv_org_abs_2501_00321
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
Fu, Ling
Kuang, Zhebin
Song, Jiajun
Huang, Mingxin
Yang, Biao
Li, Yuzhe
Zhu, Linghao
Luo, Qidi
Wang, Xinyu
Lu, Hao
Li, Zhang
Tang, Guozhi
Shan, Bin
Lin, Chunhui
Liu, Qi
Wu, Binghong
Feng, Hao
Liu, Hao
Huang, Can
Tang, Jingqun
Chen, Wei
Jin, Lianwen
Liu, Yuliang
Bai, Xiang
Computer Vision and Pattern Recognition
Artificial Intelligence
Scoring the Optical Character Recognition (OCR) capabilities of Large Multimodal Models (LMMs) has witnessed growing interest. Existing benchmarks have highlighted the impressive performance of LMMs in text recognition; however, their abilities in certain challenging tasks, such as text localization, handwritten content extraction, and logical reasoning, remain underexplored. To bridge this gap, we introduce OCRBench v2, a large-scale bilingual text-centric benchmark with currently the most comprehensive set of tasks (4x more tasks than the previous multi-scene benchmark OCRBench), the widest coverage of scenarios (31 diverse scenarios), and thorough evaluation metrics, with 10,000 human-verified question-answering pairs and a high proportion of difficult samples. Moreover, we construct a private test set with 1,500 manually annotated images. The consistent evaluation trends observed across both public and private test sets validate the OCRBench v2's reliability. After carefully benchmarking state-of-the-art LMMs, we find that most LMMs score below 50 (100 in total) and suffer from five-type limitations, including less frequently encountered text recognition, fine-grained perception, layout perception, complex element parsing, and logical reasoning. The project website is at: https://99franklin.github.io/ocrbench_v2/
title OCRBench v2: An Improved Benchmark for Evaluating Large Multimodal Models on Visual Text Localization and Reasoning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2501.00321