Saved in:
Bibliographic Details
Main Authors: Acharjee, Suvojit, Aich, Utathya, Ali, Asfak
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.12459
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866908540730343424
author Acharjee, Suvojit
Aich, Utathya
Ali, Asfak
author_facet Acharjee, Suvojit
Aich, Utathya
Ali, Asfak
contents Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.
format Preprint
id arxiv_https___arxiv_org_abs_2509_12459
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Does Language Model Understand Language?
Acharjee, Suvojit
Aich, Utathya
Ali, Asfak
Computation and Language
Despite advances in natural language generation and understanding, LM still struggle with fine grained linguistic phenomena such as tense, negation, voice, and modality which are the elements central to effective human communication. In the context of the United Nations SDG 4, where linguistic clarity is critical, the deployment of LMs in educational technologies demands careful scrutiny. As LMs are increasingly powering applications like tutoring systems, automated grading, and translation, their alignment with human linguistic interpretation becomes essential for effective learning. In this study, we conduct a evaluation of SOTA language models across these challenging contexts in both English and Bengali. To ensure a structured assessment, we introduce a new Route for Evaluation of Cognitive Inference in Systematic Environments guidelines. Our proposed LUCID dataset, composed of carefully crafted sentence pairs in English and Bengali, specifically challenges these models on critical aspects of language comprehension, including negation, tense, voice variations. We assess the performance of SOTA models including MISTRAL-SABA-24B, LLaMA-4-Scout-17B, LLaMA-3.3-70B, Gemma2-9B, and Compound-Beta using standard metrics like Pearson correlation, Spearman correlation, and Mean Absolute Error, as well as novel, linguistically inspired metric the HCE accuracy. The HCE accuracy measures how often model predictions fall within one standard deviation of the mean human rating, thus capturing human like tolerance for variability in language interpretation. Our findings highlight Compound-Beta as the most balanced model, consistently achieving high correlations and low MAEs across diverse language conditions. It records the highest Pearson correlation in English and demonstrates robust performance on mixed-language data, indicating a strong alignment with human judgments in cross lingual scenarios.
title Does Language Model Understand Language?
topic Computation and Language
url https://arxiv.org/abs/2509.12459