Saved in:
| Main Authors: | Gioacchini, Luca, Siracusano, Giuseppe, Sanvito, Davide, Gashteovski, Kiril, Friede, David, Bifulco, Roberto, Lawrence, Carolin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.06411 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
AutoPenBench: Benchmarking Generative Agents for Penetration Testing
by: Gioacchini, Luca, et al.
Published: (2024)
by: Gioacchini, Luca, et al.
Published: (2024)
MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis
by: Rose, Daniel, et al.
Published: (2025)
by: Rose, Daniel, et al.
Published: (2025)
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
by: Errica, Federico, et al.
Published: (2024)
by: Errica, Federico, et al.
Published: (2024)
Compositional Steering of Large Language Models with Steering Tokens
by: Radevski, Gorjan, et al.
Published: (2026)
by: Radevski, Gorjan, et al.
Published: (2026)
Clean Up the Mess: Addressing Data Pollution in Cryptocurrency Abuse Reporting Services
by: Gomez, Gibran, et al.
Published: (2024)
by: Gomez, Gibran, et al.
Published: (2024)
TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action
by: Zhou, Chenyue, et al.
Published: (2025)
by: Zhou, Chenyue, et al.
Published: (2025)
Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection
by: Dukić, David, et al.
Published: (2023)
by: Dukić, David, et al.
Published: (2023)
Evaluating Language Models as Synthetic Data Generators
by: Kim, Seungone, et al.
Published: (2024)
by: Kim, Seungone, et al.
Published: (2024)
LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization
by: Enomoto, Masafumi, et al.
Published: (2024)
by: Enomoto, Masafumi, et al.
Published: (2024)
Robust Text Classification: Analyzing Prototype-Based Networks
by: Sourati, Zhivar, et al.
Published: (2023)
by: Sourati, Zhivar, et al.
Published: (2023)
Scaling Evaluation-time Compute with Reasoning Models as Evaluators
by: Kim, Seungone, et al.
Published: (2025)
by: Kim, Seungone, et al.
Published: (2025)
Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
by: Liu, Emmy, et al.
Published: (2025)
by: Liu, Emmy, et al.
Published: (2025)
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)
by: Andriushchenko, Maksym, et al.
Published: (2024)
AgentSquare: Automatic LLM Agent Search in Modular Design Space
by: Shang, Yu, et al.
Published: (2024)
by: Shang, Yu, et al.
Published: (2024)
TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
by: Xu, Frank F., et al.
Published: (2024)
by: Xu, Frank F., et al.
Published: (2024)
Position: LLM Unlearning Benchmarks are Weak Measures of Progress
by: Thaker, Pratiksha, et al.
Published: (2024)
by: Thaker, Pratiksha, et al.
Published: (2024)
On Synthesizing Data for Context Attribution in Question Answering
by: Radevski, Gorjan, et al.
Published: (2025)
by: Radevski, Gorjan, et al.
Published: (2025)
ADRA-Bank: A Modular Benchmark for Academic Deep Research Agents
by: Guo, Zhihan, et al.
Published: (2025)
by: Guo, Zhihan, et al.
Published: (2025)
AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents
by: Liu, Xuannan, et al.
Published: (2026)
by: Liu, Xuannan, et al.
Published: (2026)
Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
by: Wang, Siyuan, et al.
Published: (2024)
by: Wang, Siyuan, et al.
Published: (2024)
When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
by: Qian, Lingfei, et al.
Published: (2025)
by: Qian, Lingfei, et al.
Published: (2025)
SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
by: Shen, Yujiong, et al.
Published: (2026)
by: Shen, Yujiong, et al.
Published: (2026)
Conversational Health Agents: A Personalized LLM-Powered Agent Framework
by: Abbasian, Mahyar, et al.
Published: (2023)
by: Abbasian, Mahyar, et al.
Published: (2023)
Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
by: Jang, Lawrence Keunho, et al.
Published: (2026)
by: Jang, Lawrence Keunho, et al.
Published: (2026)
Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)
by: Fu, Lingyue, et al.
Published: (2025)
PABU: Progress-Aware Belief Update for Efficient LLM Agents
by: Jiang, Haitao, et al.
Published: (2026)
by: Jiang, Haitao, et al.
Published: (2026)
Robotouille: An Asynchronous Planning Benchmark for LLM Agents
by: Gonzalez-Pumariega, Gonzalo, et al.
Published: (2025)
by: Gonzalez-Pumariega, Gonzalo, et al.
Published: (2025)
Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning
by: Cheng, Mingyue, et al.
Published: (2025)
by: Cheng, Mingyue, et al.
Published: (2025)
Infusing Theory of Mind into Socially Intelligent LLM Agents
by: Hwang, EunJeong, et al.
Published: (2025)
by: Hwang, EunJeong, et al.
Published: (2025)
SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
by: Hu, Tiansheng, et al.
Published: (2026)
by: Hu, Tiansheng, et al.
Published: (2026)
FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)
by: Lee, Gyubok, et al.
Published: (2025)
AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
by: Xi, Zhiheng, et al.
Published: (2025)
by: Xi, Zhiheng, et al.
Published: (2025)
Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)
by: Li, Xiaochuan, et al.
Published: (2026)
Towards a Realistic Long-Term Benchmark for Open-Web Research Agents
by: Mühlbacher, Peter, et al.
Published: (2024)
by: Mühlbacher, Peter, et al.
Published: (2024)
SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
by: Wang, Hanlin, et al.
Published: (2025)
by: Wang, Hanlin, et al.
Published: (2025)
Self-Improving LLM Agents at Test-Time
by: Acikgoz, Emre Can, et al.
Published: (2025)
by: Acikgoz, Emre Can, et al.
Published: (2025)
ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
by: Jeon, YoungHoon, et al.
Published: (2026)
by: Jeon, YoungHoon, et al.
Published: (2026)
Improved LLM Agents for Financial Document Question Answering
by: Tan, Nelvin, et al.
Published: (2025)
by: Tan, Nelvin, et al.
Published: (2025)
BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents
by: Feng, Yunhao, et al.
Published: (2026)
by: Feng, Yunhao, et al.
Published: (2026)
AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
by: Tang, Jiabin, et al.
Published: (2025)
by: Tang, Jiabin, et al.
Published: (2025)
Similar Items
-
AutoPenBench: Benchmarking Generative Agents for Penetration Testing
by: Gioacchini, Luca, et al.
Published: (2024) -
MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis
by: Rose, Daniel, et al.
Published: (2025) -
What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
by: Errica, Federico, et al.
Published: (2024) -
Compositional Steering of Large Language Models with Steering Tokens
by: Radevski, Gorjan, et al.
Published: (2026) -
Clean Up the Mess: Addressing Data Pollution in Cryptocurrency Abuse Reporting Services
by: Gomez, Gibran, et al.
Published: (2024)