Saved in:
| Main Authors: | Gajjar, Pranshav, Ojo, Emmanuel, Shah, Vijay K |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.09929 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
by: Gajjar, Pranshav, et al.
Published: (2026)
by: Gajjar, Pranshav, et al.
Published: (2026)
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks
by: Gajjar, Pranshav, et al.
Published: (2024)
by: Gajjar, Pranshav, et al.
Published: (2024)
ORANSight-2.0: Foundational LLMs for O-RAN
by: Gajjar, Pranshav, et al.
Published: (2025)
by: Gajjar, Pranshav, et al.
Published: (2025)
MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema
by: Gajjar, Jugal, et al.
Published: (2025)
by: Gajjar, Jugal, et al.
Published: (2025)
LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models
by: Gajjar, Pranshav, et al.
Published: (2026)
by: Gajjar, Pranshav, et al.
Published: (2026)
Tele-LLM-Hub: Building Context-Aware Multi-Agent LLM Systems for Telecom Networks
by: Gajjar, Pranshav, et al.
Published: (2025)
by: Gajjar, Pranshav, et al.
Published: (2025)
AI5GTest: AI-Driven Specification-Aware Automated Testing and Validation of 5G O-RAN Components
by: Ganiyu, Abiodun, et al.
Published: (2025)
by: Ganiyu, Abiodun, et al.
Published: (2025)
Verify Before You Fix: Agentic Execution Grounding for Trustworthy Cross-Language Code Analysis
by: Gajjar, Jugal
Published: (2026)
by: Gajjar, Jugal
Published: (2026)
Preserving Data Privacy for ML-driven Applications in Open Radio Access Networks
by: Gajjar, Pranshav, et al.
Published: (2024)
by: Gajjar, Pranshav, et al.
Published: (2024)
Enhancing Confidence Estimation in Telco LLMs via Twin-Pass CoT-Ensembling
by: Saenko, Anton, et al.
Published: (2026)
by: Saenko, Anton, et al.
Published: (2026)
LogSieve: Task-Aware CI Log Reduction for Sustainable LLM-Based Analysis
by: Barnes, Marcus Emmanuel, et al.
Published: (2026)
by: Barnes, Marcus Emmanuel, et al.
Published: (2026)
Harnessing IoT and Generative AI for Weather-Adaptive Learning in Climate Resilience Education
by: Khan, Imran S. A., et al.
Published: (2025)
by: Khan, Imran S. A., et al.
Published: (2025)
PostTrainBench: Can LLM Agents Automate LLM Post-Training?
by: Rank, Ben, et al.
Published: (2026)
by: Rank, Ben, et al.
Published: (2026)
Pruning the Unsurprising: Efficient LLM Reasoning via First-Token Surprisal
by: Zeng, Wenhao, et al.
Published: (2025)
by: Zeng, Wenhao, et al.
Published: (2025)
Agint: Agentic Graph Compilation for Software Engineering Agents
by: Chivukula, Abhi, et al.
Published: (2025)
by: Chivukula, Abhi, et al.
Published: (2025)
OSS-Bench: Benchmark Generator for Coding LLMs
by: Jiang, Yuancheng, et al.
Published: (2025)
by: Jiang, Yuancheng, et al.
Published: (2025)
CSR-Bench: Benchmarking LLM Agents in Deployment of Computer Science Research Repositories
by: Xiao, Yijia, et al.
Published: (2025)
by: Xiao, Yijia, et al.
Published: (2025)
EnvBench: A Benchmark for Automated Environment Setup
by: Eliseeva, Aleksandra, et al.
Published: (2025)
by: Eliseeva, Aleksandra, et al.
Published: (2025)
RepairBench: Leaderboard of Frontier Models for Program Repair
by: Silva, André, et al.
Published: (2024)
by: Silva, André, et al.
Published: (2024)
ThrowBench: Benchmarking LLMs by Predicting Runtime Exceptions
by: Prenner, Julian Aron, et al.
Published: (2025)
by: Prenner, Julian Aron, et al.
Published: (2025)
FairQuant: Certifying and Quantifying Fairness of Deep Neural Networks
by: Kim, Brian Hyeongseok, et al.
Published: (2024)
by: Kim, Brian Hyeongseok, et al.
Published: (2024)
A Pragmatic Way to Measure Chain-of-Thought Monitorability
by: Emmons, Scott, et al.
Published: (2025)
by: Emmons, Scott, et al.
Published: (2025)
AICD Bench: A Challenging Benchmark for AI-Generated Code Detection
by: Orel, Daniil, et al.
Published: (2026)
by: Orel, Daniil, et al.
Published: (2026)
EsoLang-Bench: Evaluating Genuine Reasoning in Large Language Models via Esoteric Programming Languages
by: Sharma, Aman, et al.
Published: (2026)
by: Sharma, Aman, et al.
Published: (2026)
MobileDev-Bench: A Benchmark for Issue Resolution in Mobile Application Development
by: Fakorede, Moshood A., et al.
Published: (2026)
by: Fakorede, Moshood A., et al.
Published: (2026)
InfiBench: Evaluating the Question-Answering Capabilities of Code Large Language Models
by: Li, Linyi, et al.
Published: (2024)
by: Li, Linyi, et al.
Published: (2024)
AssertionBench: A Benchmark to Evaluate Large-Language Models for Assertion Generation
by: Pulavarthi, Vaishnavi, et al.
Published: (2024)
by: Pulavarthi, Vaishnavi, et al.
Published: (2024)
CoDocBench: A Dataset for Code-Documentation Alignment in Software Maintenance
by: Pai, Kunal, et al.
Published: (2025)
by: Pai, Kunal, et al.
Published: (2025)
CPP-UT-Bench: Can LLMs Write Complex Unit Tests in C++?
by: Bhargava, Vaishnavi, et al.
Published: (2024)
by: Bhargava, Vaishnavi, et al.
Published: (2024)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
by: Errica, Federico, et al.
Published: (2024)
by: Errica, Federico, et al.
Published: (2024)
Enhancing Resilience and Scalability in Travel Booking Systems: A Microservices Approach to Fault Tolerance, Load Balancing, and Service Discovery
by: Barua, Biman, et al.
Published: (2024)
by: Barua, Biman, et al.
Published: (2024)
When Data Quality Issues Collide: A Large-Scale Empirical Study of Co-Occurring Data Quality Issues in Software Defect Prediction
by: Dapaah, Emmanuel Charleson, et al.
Published: (2025)
by: Dapaah, Emmanuel Charleson, et al.
Published: (2025)
To Err is Machine: Vulnerability Detection Challenges LLM Reasoning
by: Steenhoek, Benjamin, et al.
Published: (2024)
by: Steenhoek, Benjamin, et al.
Published: (2024)
DeputyDev -- AI Powered Developer Assistant: Breaking the Code Review Logjam through Contextual AI to Boost Developer Productivity
by: Khare, Vishal, et al.
Published: (2025)
by: Khare, Vishal, et al.
Published: (2025)
Insights Generator: Systematic Corpus-Level Trace Diagnostics for LLM Agents
by: Manglik, Akshay, et al.
Published: (2026)
by: Manglik, Akshay, et al.
Published: (2026)
IDE-Bench: Evaluating Large Language Models as IDE Agents on Real-World Software Engineering Tasks
by: Mateega, Spencer, et al.
Published: (2026)
by: Mateega, Spencer, et al.
Published: (2026)
Can LLMs Reason Like Automated Theorem Provers for Rust Verification? VCoT-Bench: Evaluating via Verification Chain of Thought
by: Xie, Zichen, et al.
Published: (2026)
by: Xie, Zichen, et al.
Published: (2026)
LLM Critics Help Catch LLM Bugs
by: McAleese, Nat, et al.
Published: (2024)
by: McAleese, Nat, et al.
Published: (2024)
Feature Noise Resilient for QoS Prediction with Probabilistic Deep Supervision
by: Wang, Ziliang, et al.
Published: (2023)
by: Wang, Ziliang, et al.
Published: (2023)
Towards Enhancing the Reproducibility of Deep Learning Bugs: An Empirical Study
by: Shah, Mehil B., et al.
Published: (2024)
by: Shah, Mehil B., et al.
Published: (2024)
Similar Items
-
TeleEmbedBench: A Multi-Corpus Embedding Benchmark for RAG in Telecommunications
by: Gajjar, Pranshav, et al.
Published: (2026) -
ORAN-Bench-13K: An Open Source Benchmark for Assessing LLMs in Open Radio Access Networks
by: Gajjar, Pranshav, et al.
Published: (2024) -
ORANSight-2.0: Foundational LLMs for O-RAN
by: Gajjar, Pranshav, et al.
Published: (2025) -
MLCPD: A Unified Multi-Language Code Parsing Dataset with Universal AST Schema
by: Gajjar, Jugal, et al.
Published: (2025) -
LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models
by: Gajjar, Pranshav, et al.
Published: (2026)