Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zolkepli, Husein, Razak, Aisyah, Adha, Kamarul, Nazhan, Ariff
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2401.14680
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910309794447360
author	Zolkepli, Husein Razak, Aisyah Adha, Kamarul Nazhan, Ariff
author_facet	Zolkepli, Husein Razak, Aisyah Adha, Kamarul Nazhan, Ariff
contents	Addressing the gap in Large Language Model pretrained from scratch with Malaysian context, We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset, equivalent to 90 billion tokens based on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch. MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language. Although trained on a smaller dataset of 90 billion tokens, our instruction-tuned MaLLaM models perform competitively. When compared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned models demonstrate notable proficiency, underscoring the effectiveness of our approach in capturing and understanding the nuances of the Malaysian language. MaLLaM models mark a significant contribution to the field, providing comprehensive language representations grounded in Malaysian context. This endeavor aims to pave the way for enhanced natural language understanding and generation tasks specific to the linguistic nuances present in Malaysia. We discuss the training methodology, dataset composition, and the potential impact of MaLLaM in advancing the capabilities of large language models within the context of the Malay language. All models released at https://huggingface.co/collections/mesolitica/mallam-6577b59d1e0b436ae75f930f
format	Preprint
id	arxiv_https___arxiv_org_abs_2401_14680
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MaLLaM -- Malaysia Large Language Model Zolkepli, Husein Razak, Aisyah Adha, Kamarul Nazhan, Ariff Computation and Language Addressing the gap in Large Language Model pretrained from scratch with Malaysian context, We trained models with 1.1 billion, 3 billion, and 5 billion parameters on a substantial 349GB dataset, equivalent to 90 billion tokens based on our pretrained Byte Pair Encoding (BPE) tokenizer for a single epoch. MaLLaM contributes to enhanced natural language understanding and generation tasks in the Malay language. Although trained on a smaller dataset of 90 billion tokens, our instruction-tuned MaLLaM models perform competitively. When compared to ChatGPT3.5 and Malaysian Mistral, MaLLaM's instruction-tuned models demonstrate notable proficiency, underscoring the effectiveness of our approach in capturing and understanding the nuances of the Malaysian language. MaLLaM models mark a significant contribution to the field, providing comprehensive language representations grounded in Malaysian context. This endeavor aims to pave the way for enhanced natural language understanding and generation tasks specific to the linguistic nuances present in Malaysia. We discuss the training methodology, dataset composition, and the potential impact of MaLLaM in advancing the capabilities of large language models within the context of the Malay language. All models released at https://huggingface.co/collections/mesolitica/mallam-6577b59d1e0b436ae75f930f
title	MaLLaM -- Malaysia Large Language Model
topic	Computation and Language
url	https://arxiv.org/abs/2401.14680

Similar Items