:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gioacchini, Luca, Siracusano, Giuseppe, Sanvito, Davide, Gashteovski, Kiril, Friede, David, Bifulco, Roberto, Lawrence, Carolin
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language
Online Access:	https://arxiv.org/abs/2404.06411
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

AutoPenBench: Benchmarking Generative Agents for Penetration Testing
by: Gioacchini, Luca, et al.
Published: (2024)

MEDDxAgent: A Unified Modular Agent Framework for Explainable Automatic Differential Diagnosis
by: Rose, Daniel, et al.
Published: (2025)

What Did I Do Wrong? Quantifying LLMs' Sensitivity and Consistency to Prompt Engineering
by: Errica, Federico, et al.
Published: (2024)

Compositional Steering of Large Language Models with Steering Tokens
by: Radevski, Gorjan, et al.
Published: (2026)

Clean Up the Mess: Addressing Data Pollution in Cryptocurrency Abuse Reporting Services
by: Gomez, Gibran, et al.
Published: (2024)

TextMineX: Data, Evaluation Framework and Ontology-guided LLM Pipeline for Humanitarian Mine Action
by: Zhou, Chenyue, et al.
Published: (2025)

Leveraging Open Information Extraction for More Robust Domain Transfer of Event Trigger Detection
by: Dukić, David, et al.
Published: (2023)

Evaluating Language Models as Synthetic Data Generators
by: Kim, Seungone, et al.
Published: (2024)

LightPAL: Lightweight Passage Retrieval for Open Domain Multi-Document Summarization
by: Enomoto, Masafumi, et al.
Published: (2024)

Robust Text Classification: Analyzing Prototype-Based Networks
by: Sourati, Zhivar, et al.
Published: (2023)

Scaling Evaluation-time Compute with Reasoning Models as Evaluators
by: Kim, Seungone, et al.
Published: (2025)

Not-Just-Scaling Laws: Towards a Better Understanding of the Downstream Impact of Language Model Design Decisions
by: Liu, Emmy, et al.
Published: (2025)

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
by: Andriushchenko, Maksym, et al.
Published: (2024)

AgentSquare: Automatic LLM Agent Search in Modular Design Space
by: Shang, Yu, et al.
Published: (2024)

TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
by: Xu, Frank F., et al.
Published: (2024)

Position: LLM Unlearning Benchmarks are Weak Measures of Progress
by: Thaker, Pratiksha, et al.
Published: (2024)

On Synthesizing Data for Context Attribution in Question Answering
by: Radevski, Gorjan, et al.
Published: (2025)

ADRA-Bank: A Modular Benchmark for Academic Deep Research Agents
by: Guo, Zhihan, et al.
Published: (2025)

AgentHallu: Benchmarking Automated Hallucination Attribution of LLM-based Agents
by: Liu, Xuannan, et al.
Published: (2026)

Benchmark Self-Evolving: A Multi-Agent Framework for Dynamic LLM Evaluation
by: Wang, Siyuan, et al.
Published: (2024)

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents
by: Qian, Lingfei, et al.
Published: (2025)

SciAgentGym: Benchmarking Multi-Step Scientific Tool-use in LLM Agents
by: Shen, Yujiong, et al.
Published: (2026)

Conversational Health Agents: A Personalized LLM-Powered Agent Framework
by: Abbasian, Mahyar, et al.
Published: (2023)

Odysseys: Benchmarking Web Agents on Realistic Long Horizon Tasks
by: Jang, Lawrence Keunho, et al.
Published: (2026)

Automatically Benchmarking LLM Code Agents through Agent-Driven Annotation and Evaluation
by: Fu, Lingyue, et al.
Published: (2025)

PABU: Progress-Aware Belief Update for Efficient LLM Agents
by: Jiang, Haitao, et al.
Published: (2026)

Robotouille: An Asynchronous Planning Benchmark for LLM Agents
by: Gonzalez-Pumariega, Gonzalo, et al.
Published: (2025)

Agent-R1: A Unified and Modular Framework for Agentic Reinforcement Learning
by: Cheng, Mingyue, et al.
Published: (2025)

Infusing Theory of Mind into Socially Intelligent LLM Agents
by: Hwang, EunJeong, et al.
Published: (2025)

SAGE: Benchmarking and Improving Retrieval for Deep Research Agents
by: Hu, Tiansheng, et al.
Published: (2026)

FHIR-AgentBench: Benchmarking LLM Agents for Realistic Interoperable EHR Question Answering
by: Lee, Gyubok, et al.
Published: (2025)

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress
by: Xi, Zhiheng, et al.
Published: (2025)

Benchmark Test-Time Scaling of General LLM Agents
by: Li, Xiaochuan, et al.
Published: (2026)

Towards a Realistic Long-Term Benchmark for Open-Web Research Agents
by: Mühlbacher, Peter, et al.
Published: (2024)

SPA-RL: Reinforcing LLM Agents via Stepwise Progress Attribution
by: Wang, Hanlin, et al.
Published: (2025)

Self-Improving LLM Agents at Test-Time
by: Acikgoz, Emre Can, et al.
Published: (2025)

ISD-Agent-Bench: A Comprehensive Benchmark for Evaluating LLM-based Instructional Design Agents
by: Jeon, YoungHoon, et al.
Published: (2026)

Improved LLM Agents for Financial Document Question Answering
by: Tan, Nelvin, et al.
Published: (2025)

BackdoorAgent: A Unified Framework for Backdoor Attacks on LLM-based Agents
by: Feng, Yunhao, et al.
Published: (2026)

AutoAgent: A Fully-Automated and Zero-Code Framework for LLM Agents
by: Tang, Jiabin, et al.
Published: (2025)