:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Nalbandyan, Grigor, Shahbazyan, Rima, Bakhturina, Evelina
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2503.00137
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
by: Gitman, Daria, et al.
Published: (2025)

CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
by: Lee, Youngwon, et al.
Published: (2024)

Retrieval meets Long Context Large Language Models
by: Xu, Peng, et al.
Published: (2023)

A Chat About Boring Problems: Studying GPT-based text normalization
by: Zhang, Yang, et al.
Published: (2023)

SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
by: Shomee, Homaira Huda, et al.
Published: (2026)

Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models
by: Palnitkar, Aadi, et al.
Published: (2026)

Representational Curvature Modulates Behavioral Uncertainty in Large Language Models
by: King, Jack, et al.
Published: (2026)

A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
by: Tan, Ting Fang, et al.
Published: (2024)

Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test
by: Pantelidou, Nikoleta, et al.
Published: (2025)

Speaker Tagging Correction With Non-Autoregressive Language Models
by: Kirakosyan, Grigor, et al.
Published: (2024)

Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference
by: Dentella, Vittoria, et al.
Published: (2024)

Tracing the ongoing emergence of human-like reasoning in Large Language Models
by: Morosi, Paolo, et al.
Published: (2026)

Understanding AI Evaluation Patterns: How Different GPT Models Assess Vision-Language Descriptions
by: Abdoli, Sajjad, et al.
Published: (2025)

Quantification and object perception in Multimodal Large Language Models and human linguistic cognition
by: Montero, Raquel, et al.
Published: (2025)

AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition
by: Wang, Yun, et al.
Published: (2025)

SCORE: A Semantic Evaluation Framework for Generative Document Parsing
by: Li, Renyu, et al.
Published: (2025)

Large Language Model probabilities cannot distinguish between possible and impossible language
by: Leivada, Evelina, et al.
Published: (2025)

AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models
by: Adak, Sayantan, et al.
Published: (2025)

A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks
by: Ni, Xuanfan, et al.
Published: (2024)

A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?
by: Leivada, Evelina, et al.
Published: (2023)

Evaluating the Retrieval Robustness of Large Language Models
by: Cao, Shuyang, et al.
Published: (2025)

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
by: Banerjee, Somnath, et al.
Published: (2024)

NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms
by: Zheng, Jonathan, et al.
Published: (2024)

Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
by: Zhuang, Chengxu, et al.
Published: (2024)

Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
by: Hobelsberger, Christian, et al.
Published: (2025)

Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
by: Hazra, Rima, et al.
Published: (2024)

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
by: Zhao, Raoyuan, et al.
Published: (2025)

SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
by: Yi, Qiang, et al.
Published: (2025)

MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning
by: Kirtane, Neeraja, et al.
Published: (2025)

S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models
by: Lei, Fangyu, et al.
Published: (2023)

A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
by: Sarıtaş, Karahan, et al.
Published: (2025)

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
by: Hazra, Rima, et al.
Published: (2024)

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
by: Banerjee, Somnath, et al.
Published: (2025)

Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
by: Hou, Guanyu, et al.
Published: (2025)

RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models
by: Wang, Yuqing, et al.
Published: (2024)

Evaluating the Robustness of Analogical Reasoning in Large Language Models
by: Lewis, Martha, et al.
Published: (2024)

Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
by: Banerjee, Somnath, et al.
Published: (2026)

Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability
by: Islam, Ashhadul, et al.
Published: (2025)

Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations
by: Yang, Rui, et al.
Published: (2025)

LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
by: Zhang, Ming, et al.
Published: (2025)