Guardado en:
Detalles Bibliográficos
Autores principales: Doan, Khang T., Huynh, Bao G., Hoang, Dung T., Pham, Thuc D., Pham, Nhat H., Nguyen, Quan T. M., Vo, Bang Q., Hoang, Suong N.
Formato: Preprint
Publicado: 2024
Materias:
Acceso en línea:https://arxiv.org/abs/2408.12480
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866909293970718720
author Doan, Khang T.
Huynh, Bao G.
Hoang, Dung T.
Pham, Thuc D.
Pham, Nhat H.
Nguyen, Quan T. M.
Vo, Bang Q.
Hoang, Suong N.
author_facet Doan, Khang T.
Huynh, Bao G.
Hoang, Dung T.
Pham, Thuc D.
Pham, Nhat H.
Nguyen, Quan T. M.
Vo, Bang Q.
Hoang, Suong N.
contents In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
format Preprint
id arxiv_https___arxiv_org_abs_2408_12480
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
Doan, Khang T.
Huynh, Bao G.
Hoang, Dung T.
Pham, Thuc D.
Pham, Nhat H.
Nguyen, Quan T. M.
Vo, Bang Q.
Hoang, Suong N.
Machine Learning
Computation and Language
In this report, we introduce Vintern-1B, a reliable 1-billion-parameters multimodal large language model (MLLM) for Vietnamese language tasks. By integrating the Qwen2-0.5B-Instruct language model with the InternViT-300M-448px visual model, Vintern-1B is optimized for a range of applications, including optical character recognition (OCR), document extraction, and general question-answering in Vietnamese context. The model is fine-tuned on an extensive dataset of over 3 million image-question-answer pairs, achieving robust performance and reliable results across multiple Vietnamese language benchmarks like OpenViVQA and ViTextVQA. Vintern-1B is small enough to fit into various on-device applications easily. Additionally, we have open-sourced several Vietnamese vision question answering (VQA) datasets for text and diagrams, created with Gemini 1.5 Flash. Our models are available at: https://huggingface.co/5CD-AI/Vintern-1B-v2.
title Vintern-1B: An Efficient Multimodal Large Language Model for Vietnamese
topic Machine Learning
Computation and Language
url https://arxiv.org/abs/2408.12480