Saved in:
| Main Authors: | Mahmood, Alhasan, Abdaljalil, Samir, Kurban, Hasan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04532 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
by: Abdaljalil, Samir, et al.
Published: (2026)
by: Abdaljalil, Samir, et al.
Published: (2026)
Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
Halluverse-M^3: A multitask multilingual benchmark for hallucination in LLMs
by: Abdaljalil, Samir, et al.
Published: (2026)
by: Abdaljalil, Samir, et al.
Published: (2026)
SINdex: Semantic INconsistency Index for Hallucination Detection in LLMs
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
4D Synchronized Fields: Motion-Language Gaussian Splatting for Temporal Scene Understanding
by: Barhdadi, Mohamed Rayan, et al.
Published: (2026)
by: Barhdadi, Mohamed Rayan, et al.
Published: (2026)
SAFE: A Sparse Autoencoder-Based Framework for Robust Query Enrichment and Hallucination Mitigation in LLMs
by: Abdaljalil, Samir, et al.
Published: (2025)
by: Abdaljalil, Samir, et al.
Published: (2025)
Agent-as-a-Judge
by: You, Runyang, et al.
Published: (2026)
by: You, Runyang, et al.
Published: (2026)
When Chain-of-Thought Backfires: Evaluating Prompt Sensitivity in Medical Language Models
by: Sadanandan, Binesh, et al.
Published: (2026)
by: Sadanandan, Binesh, et al.
Published: (2026)
Beyond LLM-as-a-Judge: Deterministic Metrics for Multilingual Generative Text Evaluation
by: Alam, Firoj, et al.
Published: (2026)
by: Alam, Firoj, et al.
Published: (2026)
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
by: Zhang, Bang, et al.
Published: (2025)
by: Zhang, Bang, et al.
Published: (2025)
Judging the Judges: Evaluating Alignment and Vulnerabilities in LLMs-as-Judges
by: Thakur, Aman Singh, et al.
Published: (2024)
by: Thakur, Aman Singh, et al.
Published: (2024)
GHaLIB: A Multilingual Framework for Hope Speech Detection in Low-Resource Languages
by: Abdullah, Ahmed, et al.
Published: (2025)
by: Abdullah, Ahmed, et al.
Published: (2025)
M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
Mind2Web 2: Evaluating Agentic Search with Agent-as-a-Judge
by: Gou, Boyu, et al.
Published: (2025)
by: Gou, Boyu, et al.
Published: (2025)
Gaming the Judge: Unfaithful Chain-of-Thought Can Undermine Agent Evaluation
by: Khalifa, Muhammad, et al.
Published: (2026)
by: Khalifa, Muhammad, et al.
Published: (2026)
Words as Beacons: Guiding RL Agents with High-Level Language Prompts
by: Ruiz-Gonzalez, Unai, et al.
Published: (2024)
by: Ruiz-Gonzalez, Unai, et al.
Published: (2024)
Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models
by: Ali, Mehdi, et al.
Published: (2025)
by: Ali, Mehdi, et al.
Published: (2025)
The Roles of English in Evaluating Multilingual Language Models
by: Poelman, Wessel, et al.
Published: (2024)
by: Poelman, Wessel, et al.
Published: (2024)
Mitigating Translationese Bias in Multilingual LLM-as-a-Judge via Disentangled Information Bottleneck
by: Zhang, Hongbin, et al.
Published: (2026)
by: Zhang, Hongbin, et al.
Published: (2026)
LLM-RadJudge: Achieving Radiologist-Level Evaluation for X-Ray Report Generation
by: Wang, Zilong, et al.
Published: (2024)
by: Wang, Zilong, et al.
Published: (2024)
JudgeAgent: Beyond Static Benchmarks for Knowledge-Driven and Dynamic LLM Evaluation
by: Shi, Zhichao, et al.
Published: (2025)
by: Shi, Zhichao, et al.
Published: (2025)
Multilingual Training and Evaluation Resources for Vision-Language Models
by: Baiamonte, Daniela, et al.
Published: (2026)
by: Baiamonte, Daniela, et al.
Published: (2026)
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
by: Lee, Jaeyun, et al.
Published: (2026)
by: Lee, Jaeyun, et al.
Published: (2026)
The Tower of Babel Revisited: Multilingual Jailbreak Prompts on Closed-Source Large Language Models
by: Huang, Linghan, et al.
Published: (2025)
by: Huang, Linghan, et al.
Published: (2025)
Is It Good Data for Multilingual Instruction Tuning or Just Bad Multilingual Evaluation for Large Language Models?
by: Chen, Pinzhen, et al.
Published: (2024)
by: Chen, Pinzhen, et al.
Published: (2024)
Dimension-Level Intent Fidelity Evaluation for Large Language Models: Evidence from Structured Prompt Ablation
by: Peng, GAng
Published: (2026)
by: Peng, GAng
Published: (2026)
Benchmarking Prompt Sensitivity in Large Language Models
by: Razavi, Amirhossein, et al.
Published: (2025)
by: Razavi, Amirhossein, et al.
Published: (2025)
JudgeBoard: Benchmarking and Enhancing Small Language Models for Reasoning Evaluation
by: Bi, Zhenyu, et al.
Published: (2025)
by: Bi, Zhenyu, et al.
Published: (2025)
Systematic Weight Evaluation for Pruning Large Language Models: Enhancing Performance and Sustainability
by: Islam, Ashhadul, et al.
Published: (2025)
by: Islam, Ashhadul, et al.
Published: (2025)
Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs
by: Hua, Andong, et al.
Published: (2025)
by: Hua, Andong, et al.
Published: (2025)
Evaluating LLM-Based Goal Extraction in Requirements Engineering: Prompting Strategies and Their Limitations
by: Arnaudo, Anna, et al.
Published: (2026)
by: Arnaudo, Anna, et al.
Published: (2026)
MTQ-Eval: Multilingual Text Quality Evaluation for Language Models
by: Pokharel, Rhitabrat, et al.
Published: (2025)
by: Pokharel, Rhitabrat, et al.
Published: (2025)
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
by: Chuang, Yao-Shun, et al.
Published: (2026)
by: Chuang, Yao-Shun, et al.
Published: (2026)
JAILJUDGE: A Comprehensive Jailbreak Judge Benchmark with Multi-Agent Enhanced Explanation Evaluation Framework
by: Liu, Fan, et al.
Published: (2024)
by: Liu, Fan, et al.
Published: (2024)
Prompting with Phonemes: Enhancing LLMs' Multilinguality for Non-Latin Script Languages
by: Nguyen, Hoang H, et al.
Published: (2024)
by: Nguyen, Hoang H, et al.
Published: (2024)
JudgeLM: Fine-tuned Large Language Models are Scalable Judges
by: Zhu, Lianghui, et al.
Published: (2023)
by: Zhu, Lianghui, et al.
Published: (2023)
Evaluating Metrics for Safety with LLM-as-Judges
by: Clegg, Kester, et al.
Published: (2025)
by: Clegg, Kester, et al.
Published: (2025)
Similar Items
-
Evaluating Multilingual and Code-Switched Alignment in LLMs via Synthetic Natural Language Inference
by: Abdaljalil, Samir, et al.
Published: (2025) -
HalluVerse25: Fine-grained Multilingual Benchmark Dataset for LLM Hallucinations
by: Abdaljalil, Samir, et al.
Published: (2025) -
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
by: Abdaljalil, Samir, et al.
Published: (2025) -
Knowing When Not to Answer: Abstention-Aware Scientific Reasoning
by: Abdaljalil, Samir, et al.
Published: (2026) -
Audit-of-Understanding: Posterior-Constrained Inference for Mathematical Reasoning in Language Models
by: Abdaljalil, Samir, et al.
Published: (2025)