Saved in:
| Main Authors: | Fandina, Ora Nova, Choshen, Leshem, Farchi, Eitan, Kour, George, Perlitz, Yotam, Raz, Orna |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.12259 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
by: Fandina, Ora Nova, et al.
Published: (2025)
by: Fandina, Ora Nova, et al.
Published: (2025)
Exploring Straightforward Conversational Red-Teaming
by: Kour, George, et al.
Published: (2024)
by: Kour, George, et al.
Published: (2024)
Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
by: Fandina, Ora Nova, et al.
Published: (2025)
by: Fandina, Ora Nova, et al.
Published: (2025)
Generating Unseen Code Tests In Infinitum
by: Zalmanovici, Marcel, et al.
Published: (2024)
by: Zalmanovici, Marcel, et al.
Published: (2024)
Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)
by: Farchi, Eitan, et al.
Published: (2024)
Beyond Blind Spots: Analytic Hints for Mitigating LLM-Based Evaluation Pitfalls
by: Fandina, Ora Nova, et al.
Published: (2025)
by: Fandina, Ora Nova, et al.
Published: (2025)
Using Combinatorial Optimization to Design a High quality LLM Solution
by: Ackerman, Samuel, et al.
Published: (2024)
by: Ackerman, Samuel, et al.
Published: (2024)
LaajMeter: A Framework for LaaJ Evaluation
by: Ackerman, Samuel, et al.
Published: (2025)
by: Ackerman, Samuel, et al.
Published: (2025)
Isolating LLM Lexical Bias: A Curation-Free Triangulated Metric for Preference-Stage Learning
by: Ming, Xiaoyang, et al.
Published: (2026)
by: Ming, Xiaoyang, et al.
Published: (2026)
Instructions Shape Production of Language, not Processing
by: Waldis, Andreas, et al.
Published: (2026)
by: Waldis, Andreas, et al.
Published: (2026)
Prompt Engineering and the Effectiveness of Large Language Models in Enhancing Human Productivity
by: Anam, Rizal Khoirul
Published: (2025)
by: Anam, Rizal Khoirul
Published: (2025)
Evaluation of RAG Metrics for Question Answering in the Telecom Domain
by: Roychowdhury, Sujoy, et al.
Published: (2024)
by: Roychowdhury, Sujoy, et al.
Published: (2024)
SafeAnchor: Preventing Cumulative Safety Erosion in Continual Domain Adaptation of Large Language Models
by: Guo, Dongxin, et al.
Published: (2026)
by: Guo, Dongxin, et al.
Published: (2026)
RHealthTwin: Towards Responsible and Multimodal Digital Twins for Personalized Well-being
by: Ferdousi, Rahatara, et al.
Published: (2025)
by: Ferdousi, Rahatara, et al.
Published: (2025)
Advancing Explainability in Neural Machine Translation: Analytical Metrics for Attention and Alignment Consistency
by: Mishra, Anurag
Published: (2024)
by: Mishra, Anurag
Published: (2024)
LLMs as Deceptive Agents: How Role-Based Prompting Induces Semantic Ambiguity in Puzzle Tasks
by: Yoo, Seunghyun
Published: (2025)
by: Yoo, Seunghyun
Published: (2025)
Bidirectional RAG: Safe Self-Improving Retrieval-Augmented Generation Through Multi-Stage Validation
by: Chinthala, Teja
Published: (2025)
by: Chinthala, Teja
Published: (2025)
OptPO: Optimal Rollout Allocation for Test-time Policy Optimization
by: Wang, Youkang, et al.
Published: (2025)
by: Wang, Youkang, et al.
Published: (2025)
Forging GEMs: Advancing Greek NLP through Quality-Based Corpus Curation
by: Apostolopoulou, Alexandra, et al.
Published: (2025)
by: Apostolopoulou, Alexandra, et al.
Published: (2025)
ReliabilityBench: Evaluating LLM Agent Reliability Under Production-Like Stress Conditions
by: Gupta, Aayush
Published: (2026)
by: Gupta, Aayush
Published: (2026)
Towards a Reliable Offline Personal AI Assistant for Long Duration Spaceflight
by: Bensch, Oliver, et al.
Published: (2024)
by: Bensch, Oliver, et al.
Published: (2024)
Survey of Swarm Intelligence Approaches to Search Documents Based On Semantic Similarity
by: Muniyappa, Chandrashekar, et al.
Published: (2025)
by: Muniyappa, Chandrashekar, et al.
Published: (2025)
Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring
by: Aksoy, Sinan G., et al.
Published: (2026)
by: Aksoy, Sinan G., et al.
Published: (2026)
LLM-Assisted Crisis Management: Building Advanced LLM Platforms for Effective Emergency Response and Public Collaboration
by: Otal, Hakan T., et al.
Published: (2024)
by: Otal, Hakan T., et al.
Published: (2024)
Statistical multi-metric evaluation and visualization of LLM system predictive performance
by: Ackerman, Samuel, et al.
Published: (2025)
by: Ackerman, Samuel, et al.
Published: (2025)
Empowering Tabular Data Preparation with Language Models: Why and How?
by: Chen, Mengshi, et al.
Published: (2025)
by: Chen, Mengshi, et al.
Published: (2025)
Monetizing Currency Pair Sentiments through LLM Explainability
by: Limonad, Lior, et al.
Published: (2024)
by: Limonad, Lior, et al.
Published: (2024)
SwiftDossier: Tailored Automatic Dossier for Drug Discovery with LLMs and Agents
by: Fossi, Gabriele, et al.
Published: (2024)
by: Fossi, Gabriele, et al.
Published: (2024)
Evaluation Metrics for Automated Typographic Poster Generation
by: Rebelo, Sérgio M., et al.
Published: (2024)
by: Rebelo, Sérgio M., et al.
Published: (2024)
Multi-chain Graph Refinement and Selection for Reliable Reasoning in Large Language Models
by: Yang, Yujiao, et al.
Published: (2025)
by: Yang, Yujiao, et al.
Published: (2025)
Holmes: A Benchmark to Assess the Linguistic Competence of Language Models
by: Waldis, Andreas, et al.
Published: (2024)
by: Waldis, Andreas, et al.
Published: (2024)
OPENXRD: A Comprehensive Benchmark Framework for LLM/MLLM XRD Question Answering
by: Vosoughi, Ali, et al.
Published: (2025)
by: Vosoughi, Ali, et al.
Published: (2025)
HInter: Exposing Hidden Intersectional Bias in Large Language Models
by: Souani, Badr, et al.
Published: (2025)
by: Souani, Badr, et al.
Published: (2025)
Pay Attention to What You Need
by: Gao, Yifei, et al.
Published: (2023)
by: Gao, Yifei, et al.
Published: (2023)
MORQA: Benchmarking Evaluation Metrics for Medical Open-Ended Question Answering
by: Yim, Wen-wai, et al.
Published: (2025)
by: Yim, Wen-wai, et al.
Published: (2025)
The Illusion of Role Separation: Hidden Shortcuts in LLM Role Learning (and How to Fix Them)
by: Wang, Zihao, et al.
Published: (2025)
by: Wang, Zihao, et al.
Published: (2025)
Computational Social Linguistics for Telugu Cultural Preservation: Novel Algorithms for Chandassu Metrical Pattern Recognition
by: Pavan, Boddu Sri, et al.
Published: (2025)
by: Pavan, Boddu Sri, et al.
Published: (2025)
Reference-Guided Verdict: LLMs-as-Judges in Automatic Evaluation of Free-Form QA
by: Badshah, Sher, et al.
Published: (2024)
by: Badshah, Sher, et al.
Published: (2024)
AgentBreeder: Mitigating the AI Safety Risks of Multi-Agent Scaffolds via Self-Improvement
by: Rosser, J, et al.
Published: (2025)
by: Rosser, J, et al.
Published: (2025)
Response Uncertainty and Probe Modeling: Two Sides of the Same Coin in LLM Interpretability?
by: Wang, Yongjie, et al.
Published: (2025)
by: Wang, Yongjie, et al.
Published: (2025)
Similar Items
-
Automated Validation of LLM-based Evaluators for Software Engineering Artifacts
by: Fandina, Ora Nova, et al.
Published: (2025) -
Exploring Straightforward Conversational Red-Teaming
by: Kour, George, et al.
Published: (2024) -
Vintage Code, Modern Judges: Meta-Validation in Low Data Regimes
by: Fandina, Ora Nova, et al.
Published: (2025) -
Generating Unseen Code Tests In Infinitum
by: Zalmanovici, Marcel, et al.
Published: (2024) -
Automatic Generation of Benchmarks and Reliable LLM Judgment for Code Tasks
by: Farchi, Eitan, et al.
Published: (2024)