Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yang, Yuchen, Yan, Haoran, Chen, Yanhao, Wu, Qingqiang, Hong, Qingqi
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2412.18327
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866915078224216064
author	Yang, Yuchen Yan, Haoran Chen, Yanhao Wu, Qingqiang Hong, Qingqi
author_facet	Yang, Yuchen Yan, Haoran Chen, Yanhao Wu, Qingqiang Hong, Qingqi
contents	Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_18327
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images Yang, Yuchen Yan, Haoran Chen, Yanhao Wu, Qingqiang Hong, Qingqi Computer Vision and Pattern Recognition Vision Question Answering (VQA) tasks use images to convey critical information to answer text-based questions, which is one of the most common forms of question answering in real-world scenarios. Numerous vision-text models exist today and have performed well on certain VQA tasks. However, these models exhibit significant limitations in understanding human annotations on text-heavy images. To address this, we propose the Human Annotation Understanding and Recognition (HAUR) task. As part of this effort, we introduce the Human Annotation Understanding and Recognition-5 (HAUR-5) dataset, which encompasses five common types of human annotations. Additionally, we developed and trained our model, OCR-Mix. Through comprehensive cross-model comparisons, our results demonstrate that OCR-Mix outperforms other models in this task. Our dataset and model will be released soon .
title	HAUR: Human Annotation Understanding and Recognition Through Text-Heavy Images
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2412.18327

Similar Items