Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Shaier, Sagi, Baker, George Arthur, Sridhar, Chiranthan, Hunter, Lawrence E, von der Wense, Katharina
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2412.10105
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910965781495808
author	Shaier, Sagi Baker, George Arthur Sridhar, Chiranthan Hunter, Lawrence E von der Wense, Katharina
author_facet	Shaier, Sagi Baker, George Arthur Sridhar, Chiranthan Hunter, Lawrence E von der Wense, Katharina
contents	Language models (LMs) have excelled in various broad domains. However, to ensure their safe and effective integration into real-world educational settings, they must demonstrate proficiency in specific, granular areas of knowledge. Existing cloze-style benchmarks, commonly used to evaluate LMs' knowledge, have three major limitations. They: 1) do not cover the educational domain; 2) typically focus on low-complexity, generic knowledge or broad domains, which do not adequately assess the models' knowledge in specific subjects; and 3) often rely on templates that can bias model predictions. Here, we introduce MALAMUTE, a multilingual, template-free, and highly granular probing dataset comprising expert-written, peer-reviewed probes from 71 university-level textbooks across three languages (English, Spanish, and Polish). MALAMUTE is the first education-based cloze-style dataset. It covers eight domains, each with up to 14 subdomains, further broken down into concepts and concept-based prompts, totaling 33,361 university curriculum concepts and 116,887 prompts. MALAMUTE's fine granularity, educational focus, and inclusion of both sentence-level and paragraph-level prompts make it an ideal tool for evaluating LMs' course-related knowledge. Our evaluation of masked and causal LMs on MALAMUTE shows that despite overall proficiency, they have significant gaps in knowledge when examined closely on specific subjects, hindering their safe use in classrooms and underscoring the need for further development.
format	Preprint
id	arxiv_https___arxiv_org_abs_2412_10105
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset Shaier, Sagi Baker, George Arthur Sridhar, Chiranthan Hunter, Lawrence E von der Wense, Katharina Computation and Language Language models (LMs) have excelled in various broad domains. However, to ensure their safe and effective integration into real-world educational settings, they must demonstrate proficiency in specific, granular areas of knowledge. Existing cloze-style benchmarks, commonly used to evaluate LMs' knowledge, have three major limitations. They: 1) do not cover the educational domain; 2) typically focus on low-complexity, generic knowledge or broad domains, which do not adequately assess the models' knowledge in specific subjects; and 3) often rely on templates that can bias model predictions. Here, we introduce MALAMUTE, a multilingual, template-free, and highly granular probing dataset comprising expert-written, peer-reviewed probes from 71 university-level textbooks across three languages (English, Spanish, and Polish). MALAMUTE is the first education-based cloze-style dataset. It covers eight domains, each with up to 14 subdomains, further broken down into concepts and concept-based prompts, totaling 33,361 university curriculum concepts and 116,887 prompts. MALAMUTE's fine granularity, educational focus, and inclusion of both sentence-level and paragraph-level prompts make it an ideal tool for evaluating LMs' course-related knowledge. Our evaluation of masked and causal LMs on MALAMUTE shows that despite overall proficiency, they have significant gaps in knowledge when examined closely on specific subjects, hindering their safe use in classrooms and underscoring the need for further development.
title	MALAMUTE: A Multilingual, Highly-granular, Template-free, Education-based Probing Dataset
topic	Computation and Language
url	https://arxiv.org/abs/2412.10105

Similar Items