:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Fandina, Ora Nova, Choshen, Leshem, Farchi, Eitan, Kour, George, Perlitz, Yotam, Raz, Orna
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence 68T50
Online Access:	https://arxiv.org/abs/2408.12259
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
by: Fandina, Ora Nova, et al.
Published: (2025)

Exploring Straightforward Conversational Red-Teaming
by: Kour, George, et al.
Published: (2024)

Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
by: Fandina, Ora Nova, et al.
Published: (2025)

Generating Unseen Code Tests In Infinitum
by: Zalmanovici, Marcel, et al.
Published: (2024)

Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)

Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls
by: Fandina, Ora Nova, et al.
Published: (2025)

Using Combinatorial Optimization to Design a High quality LLM Solution
by: Ackerman, Samuel, et al.
Published: (2024)

LaajMeter: A Framework for LaaJ Evaluation
by: Ackerman, Samuel, et al.
Published: (2025)

Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
by: Ming, Xiaoyang, et al.
Published: (2026)

Instructions Shape Production of Language, not Processing
by: Waldis, Andreas, et al.
Published: (2026)

Prompt Engineering and the Effectiveness of Large Language Models in Enhancing Human Productivity
by: Anam, Rizal Khoirul
Published: (2025)

Evaluation of RAG Metrics for Question Answering in the Telecom Domain
by: Roychowdhury, Sujoy, et al.
Published: (2024)

SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
by: Guo, Dongxin, et al.
Published: (2026)

RHealthTwin: Towards Responsible and Multimodal Digital Twins for Personalized Well-being
by: Ferdousi, Rahatara, et al.
Published: (2025)

Advancing Explainability in Neural Machine Translation: Analytical Metrics for Attention and Alignment Consistency
by: Mishra, Anurag
Published: (2024)

LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks
by: Yoo, Seunghyun
Published: (2025)

Bidirectional RAG: Safe Self-Improving Retrieval-Augmented Generation Through Multi-Stage Validation
by: Chinthala, Teja
Published: (2025)

OptPO: Optimal Rollout Allocation for Test-time Policy Optimization
by: Wang, Youkang, et al.
Published: (2025)

Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation
by: Apostolopoulou, Alexandra, et al.
Published: (2025)

ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions
by: Gupta, Aayush
Published: (2026)

Towards a Reliable Offline Personal AI Assistant for Long Duration Spaceflight
by: Bensch, Oliver, et al.
Published: (2024)

Survey of Swarm Intelligence Approaches to Search Documents Based On Semantic Similarity
by: Muniyappa, Chandrashekar, et al.
Published: (2025)

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
by: Aksoy, Sinan G., et al.
Published: (2026)

LLM-Assisted Crisis Management: Building Advanced LLM Platforms for Effective Emergency Response and Public Collaboration
by: Otal, Hakan T., et al.
Published: (2024)

Statistical multi-metric evaluation and visualization of LLM system predictive performance
by: Ackerman, Samuel, et al.
Published: (2025)

Empowering Tabular Data Preparation with Language Models: Why and How?
by: Chen, Mengshi, et al.
Published: (2025)

Monetizing Currency Pair Sentiments through LLM Explainability
by: Limonad, Lior, et al.
Published: (2024)

SwiftDossier: Tailored Automatic Dossier for Drug Discovery with LLMs and Agents
by: Fossi, Gabriele, et al.
Published: (2024)

Evaluation Metrics for Automated Typographic Poster Generation
by: Rebelo, Sérgio M., et al.
Published: (2024)

Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models
by: Yang, Yujiao, et al.
Published: (2025)

Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
by: Waldis, Andreas, et al.
Published: (2024)

OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering
by: Vosoughi, Ali, et al.
Published: (2025)

HInter: Exposing Hidden Intersectional Bias in Large Language Models
by: Souani, Badr, et al.
Published: (2025)

Pay Attention to What You Need
by: Gao, Yifei, et al.
Published: (2023)

MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering
by: Yim, Wen-wai, et al.
Published: (2025)

The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)
by: Wang, Zihao, et al.
Published: (2025)

Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition
by: Pavan, Boddu Sri, et al.
Published: (2025)

Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
by: Badshah, Sher, et al.
Published: (2024)

AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement
by: Rosser, J, et al.
Published: (2025)

Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?
by: Wang, Yongjie, et al.
Published: (2025)