Saved in:
| Main Authors: | Jin, Hexi, Liu, Stephen, Li, Yuheng, Malik, Simran, Zhang, Yiying |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.16942 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ConvexBench: Can LLMs Recognize Convex Functions?
by: Liu, Yepeng, et al.
Published: (2026)
by: Liu, Yepeng, et al.
Published: (2026)
Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
by: Li, Yubo, et al.
Published: (2026)
by: Li, Yubo, et al.
Published: (2026)
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026)
by: Xie, Sixiong, et al.
Published: (2026)
Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
by: Zhang, Zhe, et al.
Published: (2025)
by: Zhang, Zhe, et al.
Published: (2025)
KernelBench: Can LLMs Write Efficient GPU Kernels?
by: Ouyang, Anne, et al.
Published: (2025)
by: Ouyang, Anne, et al.
Published: (2025)
LifeBench: A Benchmark for Long-Horizon Multi-Source Memory
by: Cheng, Zihao, et al.
Published: (2026)
by: Cheng, Zihao, et al.
Published: (2026)
SC-Bench: A Large-Scale Dataset for Smart Contract Auditing
by: Xia, Shihao, et al.
Published: (2024)
by: Xia, Shihao, et al.
Published: (2024)
EXP-Bench: Can AI Conduct AI Research Experiments?
by: Kon, Patrick Tser Jern, et al.
Published: (2025)
by: Kon, Patrick Tser Jern, et al.
Published: (2025)
An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation
by: Li, Bingyu, et al.
Published: (2026)
by: Li, Bingyu, et al.
Published: (2026)
Deep Research Bench: Evaluating AI Web Research Agents
by: FutureSearch, et al.
Published: (2025)
by: FutureSearch, et al.
Published: (2025)
Multimodal Multihop Source Retrieval for Web Question Answering
by: Yarrabelly, Navya, et al.
Published: (2025)
by: Yarrabelly, Navya, et al.
Published: (2025)
ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities
by: Zanoli, Christopher, et al.
Published: (2026)
by: Zanoli, Christopher, et al.
Published: (2026)
LiteWebAgent: The Open-Source Suite for VLM-Based Web-Agent Applications
by: Zhang, Danqing, et al.
Published: (2025)
by: Zhang, Danqing, et al.
Published: (2025)
Can AI Assistance Aid in the Grading of Handwritten Answer Sheets?
by: Sil, Pritam, et al.
Published: (2024)
by: Sil, Pritam, et al.
Published: (2024)
InfoMosaic-Bench: Evaluating Multi-Source Information Seeking in Tool-Augmented Agents
by: Du, Yaxin, et al.
Published: (2025)
by: Du, Yaxin, et al.
Published: (2025)
Measuring Google AI Overviews: Activation, Source Quality, Claim Fidelity, and Publisher Impact
by: Xu, Haofei, et al.
Published: (2026)
by: Xu, Haofei, et al.
Published: (2026)
Dynamic Multi-Agent Orchestration and Retrieval for Multi-Source Question-Answer Systems using Large Language Models
by: Seabra, Antony, et al.
Published: (2024)
by: Seabra, Antony, et al.
Published: (2024)
WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
by: Liu, Chenxu, et al.
Published: (2026)
by: Liu, Chenxu, et al.
Published: (2026)
Judge Before Answer: Can MLLM Discern the False Premise in Question?
by: Li, Jidong, et al.
Published: (2025)
by: Li, Jidong, et al.
Published: (2025)
Generating High-Quality Datasets for Code Editing via Open-Source Language Models
by: Zhang, Zekai, et al.
Published: (2025)
by: Zhang, Zekai, et al.
Published: (2025)
BuildBench: Benchmarking LLM Agents on Compiling Real-World Open-Source Software
by: Zhang, Zehua, et al.
Published: (2025)
by: Zhang, Zehua, et al.
Published: (2025)
CocoaBench: Evaluating Unified Digital Agents in the Wild
by: CocoaBench Team, et al.
Published: (2026)
by: CocoaBench Team, et al.
Published: (2026)
ClawBench: Can AI Agents Complete Everyday Online Tasks?
by: Zhang, Yuxuan, et al.
Published: (2026)
by: Zhang, Yuxuan, et al.
Published: (2026)
TOPO-Bench: An Open-Source Topological Mapping Evaluation Framework with Quantifiable Perceptual Aliasing
by: Wang, Jiaming, et al.
Published: (2025)
by: Wang, Jiaming, et al.
Published: (2025)
VerifyBench: Benchmarking Reference-based Reward Systems for Large Language Models
by: Yan, Yuchen, et al.
Published: (2025)
by: Yan, Yuchen, et al.
Published: (2025)
InteractWeb-Bench: Can Multimodal Agent Escape Blind Execution in Interactive Website Generation?
by: Wang, Qiyao, et al.
Published: (2026)
by: Wang, Qiyao, et al.
Published: (2026)
Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model
by: Zhang, Zhenxing, et al.
Published: (2025)
by: Zhang, Zhenxing, et al.
Published: (2025)
LJ-Spoof: A Generatively Varied Corpus for Audio Anti-Spoofing and Synthesis Source Tracing
by: Subramani, Surya, et al.
Published: (2026)
by: Subramani, Surya, et al.
Published: (2026)
Dr. Bench: A Multidimensional Evaluation for Deep Research Agents, from Answers to Reports
by: Yao, Yang, et al.
Published: (2025)
by: Yao, Yang, et al.
Published: (2025)
RedacBench: Can AI Erase Your Secrets?
by: Jeon, Hyunjun, et al.
Published: (2026)
by: Jeon, Hyunjun, et al.
Published: (2026)
HumanStudy-Bench: Towards AI Agent Design for Participant Simulation
by: Liu, Xuan, et al.
Published: (2026)
by: Liu, Xuan, et al.
Published: (2026)
Cookie-Bench: Continuous On-screen Key Interaction Evaluation for Web Generation
by: Yang, Haoyue, et al.
Published: (2026)
by: Yang, Haoyue, et al.
Published: (2026)
Cognify: Supercharging Gen-AI Workflows With Hierarchical Autotuning
by: He, Zijian, et al.
Published: (2025)
by: He, Zijian, et al.
Published: (2025)
Web-Bench: A LLM Code Benchmark Based on Web Standards and Frameworks
by: Xu, Kai, et al.
Published: (2025)
by: Xu, Kai, et al.
Published: (2025)
Can Multiple Responses from an LLM Reveal the Sources of Its Uncertainty?
by: Nan, Yang, et al.
Published: (2025)
by: Nan, Yang, et al.
Published: (2025)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
by: Liu, Junpeng, et al.
Published: (2024)
by: Liu, Junpeng, et al.
Published: (2024)
Prompt Sensitivity and Answer Consistency of Small Open-Source Language Models for Clinical Question Answering in Low-Resource Healthcare
by: Hariprasad, Shravani
Published: (2026)
by: Hariprasad, Shravani
Published: (2026)
Can Agent Conquer Web? Exploring the Frontiers of ChatGPT Atlas Agent in Web Games
by: Zhang, Jingran, et al.
Published: (2025)
by: Zhang, Jingran, et al.
Published: (2025)
WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning
by: Lai, Peichao, et al.
Published: (2025)
by: Lai, Peichao, et al.
Published: (2025)
Evaluating AI for Law: Bridging the Gap with Open-Source Solutions
by: Bhambhoria, Rohan, et al.
Published: (2024)
by: Bhambhoria, Rohan, et al.
Published: (2024)
Similar Items
-
ConvexBench: Can LLMs Recognize Convex Functions?
by: Liu, Yepeng, et al.
Published: (2026) -
Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG
by: Li, Yubo, et al.
Published: (2026) -
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026) -
Code2Bench: Scaling Source and Rigor for Dynamic Benchmark Construction
by: Zhang, Zhe, et al.
Published: (2025) -
KernelBench: Can LLMs Write Efficient GPU Kernels?
by: Ouyang, Anne, et al.
Published: (2025)