Saved in:
Bibliographic Details
Main Authors: Alam, Nahid, Kanjula, Karthik Reddy, Guthikonda, Surya, Chung, Timothy, Vegesna, Bala Krishna S, Das, Abhipsha, Susevski, Anthony, Chan, Ryan Sze-Yin, Uddin, S M Iftekhar, Islam, Shayekh Bin, Santhosh, Roshan, A, Snegha, Sharma, Drishti, Liu, Chen, Chaturvedi, Isha, Winata, Genta Indra, S, Ashvanth., Mukherjee, Snehanshu, Aji, Alham Fikri
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2412.07112
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913604262952960
author Alam, Nahid
Kanjula, Karthik Reddy
Guthikonda, Surya
Chung, Timothy
Vegesna, Bala Krishna S
Das, Abhipsha
Susevski, Anthony
Chan, Ryan Sze-Yin
Uddin, S M Iftekhar
Islam, Shayekh Bin
Santhosh, Roshan
A, Snegha
Sharma, Drishti
Liu, Chen
Chaturvedi, Isha
Winata, Genta Indra
S, Ashvanth.
Mukherjee, Snehanshu
Aji, Alham Fikri
author_facet Alam, Nahid
Kanjula, Karthik Reddy
Guthikonda, Surya
Chung, Timothy
Vegesna, Bala Krishna S
Das, Abhipsha
Susevski, Anthony
Chan, Ryan Sze-Yin
Uddin, S M Iftekhar
Islam, Shayekh Bin
Santhosh, Roshan
A, Snegha
Sharma, Drishti
Liu, Chen
Chaturvedi, Isha
Winata, Genta Indra
S, Ashvanth.
Mukherjee, Snehanshu
Aji, Alham Fikri
contents The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
format Preprint
id arxiv_https___arxiv_org_abs_2412_07112
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Maya: An Instruction Finetuned Multilingual Multimodal Model
Alam, Nahid
Kanjula, Karthik Reddy
Guthikonda, Surya
Chung, Timothy
Vegesna, Bala Krishna S
Das, Abhipsha
Susevski, Anthony
Chan, Ryan Sze-Yin
Uddin, S M Iftekhar
Islam, Shayekh Bin
Santhosh, Roshan
A, Snegha
Sharma, Drishti
Liu, Chen
Chaturvedi, Isha
Winata, Genta Indra
S, Ashvanth.
Mukherjee, Snehanshu
Aji, Alham Fikri
Computer Vision and Pattern Recognition
Computation and Language
The rapid development of large Vision-Language Models (VLMs) has led to impressive results on academic benchmarks, primarily in widely spoken languages. However, significant gaps remain in the ability of current VLMs to handle low-resource languages and varied cultural contexts, largely due to a lack of high-quality, diverse, and safety-vetted data. Consequently, these models often struggle to understand low-resource languages and cultural nuances in a manner free from toxicity. To address these limitations, we introduce Maya, an open-source Multimodal Multilingual model. Our contributions are threefold: 1) a multilingual image-text pretraining dataset in eight languages, based on the LLaVA pretraining dataset; 2) a thorough analysis of toxicity within the LLaVA dataset, followed by the creation of a novel toxicity-free version across eight languages; and 3) a multilingual image-text model supporting these languages, enhancing cultural and linguistic comprehension in vision-language tasks. Code available at https://github.com/nahidalam/maya.
title Maya: An Instruction Finetuned Multilingual Multimodal Model
topic Computer Vision and Pattern Recognition
Computation and Language
url https://arxiv.org/abs/2412.07112