Vista Equipo: :: Library Catalog

Guardado en:

Detalles Bibliográficos
Autores principales:	Doan, Khang T., Huynh, Bao G., Hoang, Dung T., Pham, Thuc D., Pham, Nhat H., Nguyen, Quan T. M., Vo, Bang Q., Hoang, Suong N.
Formato:	Preprint
Publicado:	2024
Materias:	Machine Learning Computation and Language
Acceso en línea:	https://arxiv.org/abs/2408.12480
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

_version_	1866909293970718720
author	Doan, Khang T. Huynh, Bao G. Hoang, Dung T. Pham, Thuc D. Pham, Nhat H. Nguyen, Quan T. M. Vo, Bang Q. Hoang, Suong N.
author_facet	Doan, Khang T. Huynh, Bao G. Hoang, Dung T. Pham, Thuc D. Pham, Nhat H. Nguyen, Quan T. M. Vo, Bang Q. Hoang, Suong N.
contents	In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_12480
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese Doan, Khang T. Huynh, Bao G. Hoang, Dung T. Pham, Thuc D. Pham, Nhat H. Nguyen, Quan T. M. Vo, Bang Q. Hoang, Suong N. Machine Learning Computation and Language In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
title	Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2408.12480

Ejemplares similares