Saved in:
| Main Author: | Broadwater, Keita |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.09606 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
by: Broadwater, Keita
Published: (2026)
by: Broadwater, Keita
Published: (2026)
Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
by: Zhou, Yongxi, et al.
Published: (2026)
by: Zhou, Yongxi, et al.
Published: (2026)
A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
by: Aghazadeh-Chakherlou, Robab, et al.
Published: (2025)
by: Aghazadeh-Chakherlou, Robab, et al.
Published: (2025)
Conventional Commit Classification using Large Language Models and Prompt Engineering
by: Quadir, H. M. Sazzad, et al.
Published: (2026)
by: Quadir, H. M. Sazzad, et al.
Published: (2026)
Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting
by: Jhaveri, Samyak, et al.
Published: (2026)
by: Jhaveri, Samyak, et al.
Published: (2026)
INTERVENOR: Prompting the Coding Ability of Large Language Models with the Interactive Chain of Repair
by: Wang, Hanbin, et al.
Published: (2023)
by: Wang, Hanbin, et al.
Published: (2023)
Evaluating Large Language Models for Code Review
by: Cihan, Umut, et al.
Published: (2025)
by: Cihan, Umut, et al.
Published: (2025)
PromptPex: Automatic Test Generation for Language Model Prompts
by: Sharma, Reshabh K, et al.
Published: (2025)
by: Sharma, Reshabh K, et al.
Published: (2025)
Welcome Your New AI Teammate: On Safety Analysis by Leashing Large Language Models
by: Nouri, Ali, et al.
Published: (2024)
by: Nouri, Ali, et al.
Published: (2024)
Generating Automotive Code: Large Language Models for Software Development and Verification in Safety-Critical Systems
by: Kirchner, Sven, et al.
Published: (2025)
by: Kirchner, Sven, et al.
Published: (2025)
Evaluating Large Language Models for Detecting Architectural Decision Violations
by: Su, Ruoyu, et al.
Published: (2026)
by: Su, Ruoyu, et al.
Published: (2026)
CAKE: Cloud Architecture Knowledge Evaluation of Large Language Models
by: Adam, Tim Lukas, et al.
Published: (2026)
by: Adam, Tim Lukas, et al.
Published: (2026)
Prompt Migration: Stabilizing GenAI Applications with Evolving Large Language Models
by: Tripathi, Shivani, et al.
Published: (2025)
by: Tripathi, Shivani, et al.
Published: (2025)
Multi-Sample Prompting and Actor-Critic Prompt Optimization for Diverse Synthetic Data Generation
by: El-Hajjami, Abdelkarim, et al.
Published: (2025)
by: El-Hajjami, Abdelkarim, et al.
Published: (2025)
CrossPL: Evaluating Large Language Models on Cross Programming Language Code Generation
by: Xiong, Zhanhang, et al.
Published: (2025)
by: Xiong, Zhanhang, et al.
Published: (2025)
Beyond Accuracy: Policy Invariance as a Reliability Test for LLM Safety Judges
by: Weng, Shihao, et al.
Published: (2026)
by: Weng, Shihao, et al.
Published: (2026)
On the Challenges of Fuzzing Techniques via Large Language Models
by: Huang, Linghan, et al.
Published: (2024)
by: Huang, Linghan, et al.
Published: (2024)
Large Language Models as Test Case Generators: Performance Evaluation and Enhancement
by: Li, Kefan, et al.
Published: (2024)
by: Li, Kefan, et al.
Published: (2024)
Output Format Biases in the Evaluation of Large Language Models for Code Translation
by: Macedo, Marcos, et al.
Published: (2024)
by: Macedo, Marcos, et al.
Published: (2024)
Legal Compliance Evaluation of Smart Contracts Generated By Large Language Models
by: Wijayakoon, Chanuka, et al.
Published: (2025)
by: Wijayakoon, Chanuka, et al.
Published: (2025)
Impact of Code Context and Prompting Strategies on Automated Unit Test Generation with Modern General-Purpose Large Language Models
by: Walczak, Jakub, et al.
Published: (2025)
by: Walczak, Jakub, et al.
Published: (2025)
CodeGRAG: Bridging the Gap between Natural Language and Programming Language via Graphical Retrieval Augmented Generation
by: Du, Kounianhua, et al.
Published: (2024)
by: Du, Kounianhua, et al.
Published: (2024)
RepoDebug: Repository-Level Multi-Task and Multi-Language Debugging Evaluation of Large Language Models
by: Liu, Jingjing, et al.
Published: (2025)
by: Liu, Jingjing, et al.
Published: (2025)
AdaptEval: A Benchmark for Evaluating Large Language Models on Code Snippet Adaptation
by: Zhang, Tanghaoran, et al.
Published: (2026)
by: Zhang, Tanghaoran, et al.
Published: (2026)
COMPASS: A Multi-Dimensional Benchmark for Evaluating Code Generation in Large Language Models
by: Meaden, James, et al.
Published: (2025)
by: Meaden, James, et al.
Published: (2025)
SWE-Compass: Towards Unified Evaluation of Agentic Coding Abilities for Large Language Models
by: Xu, Jingxuan, et al.
Published: (2025)
by: Xu, Jingxuan, et al.
Published: (2025)
Evaluating Large Language Models for the Generation of Unit Tests with Equivalence Partitions and Boundary Values
by: Rodríguez, Martín, et al.
Published: (2025)
by: Rodríguez, Martín, et al.
Published: (2025)
Evaluating Large Language Models in Code Generation: INFINITE Methodology for Defining the Inference Index
by: Christakis, Nicholas, et al.
Published: (2025)
by: Christakis, Nicholas, et al.
Published: (2025)
CodeGolf Bench: A Multi-Language Benchmark for Evaluating Concise Code Generation Capabilities of Large Language Models
by: Padwal, Vedant
Published: (2026)
by: Padwal, Vedant
Published: (2026)
Asm2SrcEval: Evaluating Large Language Models for Assembly-to-Source Code Translation
by: Hamedi, Parisa, et al.
Published: (2025)
by: Hamedi, Parisa, et al.
Published: (2025)
Flow2Code: Evaluating Large Language Models for Flowchart-based Code Generation Capability
by: He, Mengliang, et al.
Published: (2025)
by: He, Mengliang, et al.
Published: (2025)
VulnLLMEval: A Framework for Evaluating Large Language Models in Software Vulnerability Detection and Patching
by: Zibaeirad, Arastoo, et al.
Published: (2024)
by: Zibaeirad, Arastoo, et al.
Published: (2024)
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
by: Sharma, Aman, et al.
Published: (2026)
by: Sharma, Aman, et al.
Published: (2026)
Towards Reliable Evaluation of Neural Program Repair with Natural Robustness Testing
by: Le-Cong, Thanh, et al.
Published: (2024)
by: Le-Cong, Thanh, et al.
Published: (2024)
TREAT: A Code LLMs Trustworthiness / Reliability Evaluation and Testing Framework
by: Gao, Shuzheng, et al.
Published: (2025)
by: Gao, Shuzheng, et al.
Published: (2025)
RobuNFR: Evaluating the Robustness of Large Language Models on Non-Functional Requirements Aware Code Generation
by: Lin, Feng, et al.
Published: (2025)
by: Lin, Feng, et al.
Published: (2025)
Evaluating Large Language Models for Functional and Maintainable Code in Industrial Settings: A Case Study at ASML
by: Mundhra, Yash, et al.
Published: (2025)
by: Mundhra, Yash, et al.
Published: (2025)
CoCo-Bench: A Comprehensive Code Benchmark For Multi-task Large Language Model Evaluation
by: Yin, Wenjing, et al.
Published: (2025)
by: Yin, Wenjing, et al.
Published: (2025)
Revisiting the Reliability of Language Models in Instruction-Following
by: Dong, Jianshuo, et al.
Published: (2025)
by: Dong, Jianshuo, et al.
Published: (2025)
Generating Energy-Efficient Code via Large-Language Models -- Where are we now?
by: Apsan, Radu, et al.
Published: (2025)
by: Apsan, Radu, et al.
Published: (2025)
Similar Items
-
Evaluating LLM Safety Under Repeated Inference via Accelerated Prompt Stress Testing
by: Broadwater, Keita
Published: (2026) -
Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks
by: Zhou, Yongxi, et al.
Published: (2026) -
A Hierarchical Imprecise Probability Approach to Reliability Assessment of Large Language Models
by: Aghazadeh-Chakherlou, Robab, et al.
Published: (2025) -
Conventional Commit Classification using Large Language Models and Prompt Engineering
by: Quadir, H. M. Sazzad, et al.
Published: (2026) -
Bridging the Gap: Empowering Small Models in Reliable OpenACC-based Parallelization via GEPA-Optimized Prompting
by: Jhaveri, Samyak, et al.
Published: (2026)