Saved in:
| Main Authors: | Nalbandyan, Grigor, Shahbazyan, Rima, Bakhturina, Evelina |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2503.00137 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
by: Gitman, Daria, et al.
Published: (2025)
by: Gitman, Daria, et al.
Published: (2025)
CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
by: Lee, Youngwon, et al.
Published: (2024)
by: Lee, Youngwon, et al.
Published: (2024)
Retrieval meets Long Context Large Language Models
by: Xu, Peng, et al.
Published: (2023)
by: Xu, Peng, et al.
Published: (2023)
A Chat About Boring Problems: Studying GPT-based text normalization
by: Zhang, Yang, et al.
Published: (2023)
by: Zhang, Yang, et al.
Published: (2023)
SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
by: Shomee, Homaira Huda, et al.
Published: (2026)
by: Shomee, Homaira Huda, et al.
Published: (2026)
Mil-SCORE: Benchmarking Long-Context Geospatial Reasoning and Planning in Large Language Models
by: Palnitkar, Aadi, et al.
Published: (2026)
by: Palnitkar, Aadi, et al.
Published: (2026)
Representational Curvature Modulates Behavioral Uncertainty in Large Language Models
by: King, Jack, et al.
Published: (2026)
by: King, Jack, et al.
Published: (2026)
A Proposed S.C.O.R.E. Evaluation Framework for Large Language Models : Safety, Consensus, Objectivity, Reproducibility and Explainability
by: Tan, Ting Fang, et al.
Published: (2024)
by: Tan, Ting Fang, et al.
Published: (2024)
Community size rather than grammatical complexity better predicts Large Language Model accuracy in a novel Wug Test
by: Pantelidou, Nikoleta, et al.
Published: (2025)
by: Pantelidou, Nikoleta, et al.
Published: (2025)
Speaker Tagging Correction With Non-Autoregressive Language Models
by: Kirakosyan, Grigor, et al.
Published: (2024)
by: Kirakosyan, Grigor, et al.
Published: (2024)
Language in Vivo vs. in Silico: Size Matters but Larger Language Models Still Do Not Comprehend Language on a Par with Humans Due to Impenetrable Semantic Reference
by: Dentella, Vittoria, et al.
Published: (2024)
by: Dentella, Vittoria, et al.
Published: (2024)
Tracing the ongoing emergence of human-like reasoning in Large Language Models
by: Morosi, Paolo, et al.
Published: (2026)
by: Morosi, Paolo, et al.
Published: (2026)
Understanding AI Evaluation Patterns: How Different GPT Models Assess Vision-Language Descriptions
by: Abdoli, Sajjad, et al.
Published: (2025)
by: Abdoli, Sajjad, et al.
Published: (2025)
Quantification and object perception in Multimodal Large Language Models and human linguistic cognition
by: Montero, Raquel, et al.
Published: (2025)
by: Montero, Raquel, et al.
Published: (2025)
AutoSCORE: Enhancing Automated Scoring with Multi-Agent Large Language Models via Structured Component Recognition
by: Wang, Yun, et al.
Published: (2025)
by: Wang, Yun, et al.
Published: (2025)
SCORE: A Semantic Evaluation Framework for Generative Document Parsing
by: Li, Renyu, et al.
Published: (2025)
by: Li, Renyu, et al.
Published: (2025)
Large Language Model probabilities cannot distinguish between possible and impossible language
by: Leivada, Evelina, et al.
Published: (2025)
by: Leivada, Evelina, et al.
Published: (2025)
AURA: Affordance-Understanding and Risk-aware Alignment Technique for Large Language Models
by: Adak, Sayantan, et al.
Published: (2025)
by: Adak, Sayantan, et al.
Published: (2025)
A Systematic Evaluation of Large Language Models for Natural Language Generation Tasks
by: Ni, Xuanfan, et al.
Published: (2024)
by: Ni, Xuanfan, et al.
Published: (2024)
A Sentence is Worth a Thousand Pictures: Can Large Language Models Understand Hum4n L4ngu4ge and the W0rld behind W0rds?
by: Leivada, Evelina, et al.
Published: (2023)
by: Leivada, Evelina, et al.
Published: (2023)
Evaluating the Retrieval Robustness of Large Language Models
by: Cao, Shuyang, et al.
Published: (2025)
by: Cao, Shuyang, et al.
Published: (2025)
SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models
by: Banerjee, Somnath, et al.
Published: (2024)
by: Banerjee, Somnath, et al.
Published: (2024)
NEO-BENCH: Evaluating Robustness of Large Language Models with Neologisms
by: Zheng, Jonathan, et al.
Published: (2024)
by: Zheng, Jonathan, et al.
Published: (2024)
Lexicon-Level Contrastive Visual-Grounding Improves Language Modeling
by: Zhuang, Chengxu, et al.
Published: (2024)
by: Zhuang, Chengxu, et al.
Published: (2024)
Systematic Evaluation of Uncertainty Estimation Methods in Large Language Models
by: Hobelsberger, Christian, et al.
Published: (2025)
by: Hobelsberger, Christian, et al.
Published: (2025)
Sowing the Wind, Reaping the Whirlwind: The Impact of Editing Language Models
by: Hazra, Rima, et al.
Published: (2024)
by: Hazra, Rima, et al.
Published: (2024)
Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors
by: Zhao, Raoyuan, et al.
Published: (2025)
by: Zhao, Raoyuan, et al.
Published: (2025)
SCORE: Story Coherence and Retrieval Enhancement for AI Narratives
by: Yi, Qiang, et al.
Published: (2025)
by: Yi, Qiang, et al.
Published: (2025)
MathRobust-LV: Evaluation of Large Language Models' Robustness to Linguistic Variations in Mathematical Reasoning
by: Kirtane, Neeraja, et al.
Published: (2025)
by: Kirtane, Neeraja, et al.
Published: (2025)
S3Eval: A Synthetic, Scalable, Systematic Evaluation Suite for Large Language Models
by: Lei, Fangyu, et al.
Published: (2023)
by: Lei, Fangyu, et al.
Published: (2023)
A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks
by: Sarıtaş, Karahan, et al.
Published: (2025)
by: Sarıtaş, Karahan, et al.
Published: (2025)
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
by: Hazra, Rima, et al.
Published: (2024)
by: Hazra, Rima, et al.
Published: (2024)
Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations
by: Banerjee, Somnath, et al.
Published: (2025)
by: Banerjee, Somnath, et al.
Published: (2025)
Evaluating Robustness of Large Audio Language Models to Audio Injection: An Empirical Study
by: Hou, Guanyu, et al.
Published: (2025)
by: Hou, Guanyu, et al.
Published: (2025)
RUPBench: Benchmarking Reasoning Under Perturbations for Robustness Evaluation in Large Language Models
by: Wang, Yuqing, et al.
Published: (2024)
by: Wang, Yuqing, et al.
Published: (2024)
Evaluating the Robustness of Analogical Reasoning in Large Language Models
by: Lewis, Martha, et al.
Published: (2024)
by: Lewis, Martha, et al.
Published: (2024)
Bridging the Multilingual Safety Divide: Efficient, Culturally-Aware Alignment for Global South Languages
by: Banerjee, Somnath, et al.
Published: (2026)
by: Banerjee, Somnath, et al.
Published: (2026)
Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability
by: Islam, Ashhadul, et al.
Published: (2025)
by: Islam, Ashhadul, et al.
Published: (2025)
Enabling Inclusive Systematic Reviews: Incorporating Preprint Articles with Large Language Model-Driven Evaluations
by: Yang, Rui, et al.
Published: (2025)
by: Yang, Rui, et al.
Published: (2025)
LLMEval-Fair: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models
by: Zhang, Ming, et al.
Published: (2025)
by: Zhang, Ming, et al.
Published: (2025)
Similar Items
-
NeMo-Inspector: A Visualization Tool for LLM Generation Analysis
by: Gitman, Daria, et al.
Published: (2025) -
CORD: Balancing COnsistency and Rank Distillation for Robust Retrieval-Augmented Generation
by: Lee, Youngwon, et al.
Published: (2024) -
Retrieval meets Long Context Large Language Models
by: Xu, Peng, et al.
Published: (2023) -
A Chat About Boring Problems: Studying GPT-based text normalization
by: Zhang, Yang, et al.
Published: (2023) -
SCORE: Specificity, Context Utilization, Robustness, and Relevance for Reference-Free LLM Evaluation
by: Shomee, Homaira Huda, et al.
Published: (2026)