Saved in:
| Main Authors: | Xu, Kai, Mao, YiWei, Guan, XinYi, Feng, ZiLong |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2505.07473 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
WebApp1K: A Practical Code-Generation Benchmark for Web App Development
by: Cui, Yi
Published: (2024)
by: Cui, Yi
Published: (2024)
Insights from Benchmarking Frontier Language Models on Web App Code Generation
by: Cui, Yi
Published: (2024)
by: Cui, Yi
Published: (2024)
WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions
by: Srivastava, Sanjari, et al.
Published: (2025)
by: Srivastava, Sanjari, et al.
Published: (2025)
WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
by: Liu, Chenxu, et al.
Published: (2026)
by: Liu, Chenxu, et al.
Published: (2026)
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
by: Levy, Ido, et al.
Published: (2024)
by: Levy, Ido, et al.
Published: (2024)
WebNovelBench: Placing LLM Novelists on the Web Novel Distribution
by: Lin, Leon, et al.
Published: (2025)
by: Lin, Leon, et al.
Published: (2025)
WebGen-V Bench: Structured Representation for Enhancing Visual Design in LLM-based Web Generation and Evaluation
by: Wang, Kuang-Da, et al.
Published: (2025)
by: Wang, Kuang-Da, et al.
Published: (2025)
EnvSimBench: A Benchmark for Evaluating and Improving LLM-Based Environment Simulation
by: Liu, Yi, et al.
Published: (2026)
by: Liu, Yi, et al.
Published: (2026)
WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents
by: Liu, Yinuo, et al.
Published: (2025)
by: Liu, Yinuo, et al.
Published: (2025)
EcoGEO: Trajectory-Aware Evidence Ecosystems for Web-Enabled LLM Search Agents
by: Ye, Hengwei, et al.
Published: (2026)
by: Ye, Hengwei, et al.
Published: (2026)
A Case Study of Web App Coding with OpenAI Reasoning Models
by: Cui, Yi
Published: (2024)
by: Cui, Yi
Published: (2024)
WebVIA: A Web-based Vision-Language Agentic Framework for Interactive and Verifiable UI-to-Code Generation
by: Xu, Mingde, et al.
Published: (2025)
by: Xu, Mingde, et al.
Published: (2025)
Needle in the Web: A Benchmark for Retrieving Targeted Web Pages in the Wild
by: Wang, Yumeng, et al.
Published: (2025)
by: Wang, Yumeng, et al.
Published: (2025)
VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning
by: Jia, Zi-Yi, et al.
Published: (2026)
by: Jia, Zi-Yi, et al.
Published: (2026)
Harnessing LLM for Noise-Robust Cognitive Diagnosis in Web-Based Intelligent Education Systems
by: Zhang, Guixian, et al.
Published: (2025)
by: Zhang, Guixian, et al.
Published: (2025)
AI Planning Framework for LLM-Based Web Agents
by: Shahnovsky, Orit, et al.
Published: (2026)
by: Shahnovsky, Orit, et al.
Published: (2026)
GI-Bench: A Panoramic Benchmark Revealing the Knowledge-Experience Dissociation of Multimodal Large Language Models in Gastrointestinal Endoscopy Against Clinical Standards
by: Zhu, Yan, et al.
Published: (2026)
by: Zhu, Yan, et al.
Published: (2026)
WebRenderBench: Enhancing Web Interface Generation through Layout-Style Consistency and Reinforcement Learning
by: Lai, Peichao, et al.
Published: (2025)
by: Lai, Peichao, et al.
Published: (2025)
WebWalker: Benchmarking LLMs in Web Traversal
by: Wu, Jialong, et al.
Published: (2025)
by: Wu, Jialong, et al.
Published: (2025)
StressWeb: A Diagnostic Benchmark for Web Agent Robustness under Realistic Interaction Variability
by: Bai, Haoyue, et al.
Published: (2026)
by: Bai, Haoyue, et al.
Published: (2026)
Tur[k]ingBench: A Challenge Benchmark for Web Agents
by: Xu, Kevin, et al.
Published: (2024)
by: Xu, Kevin, et al.
Published: (2024)
DMind Benchmark: Toward a Holistic Assessment of LLM Capabilities across the Web3 Domain
by: Huang, Enhao, et al.
Published: (2025)
by: Huang, Enhao, et al.
Published: (2025)
Holos: A Web-Scale LLM-Based Multi-Agent System for the Agentic Web
by: Nie, Xiaohang, et al.
Published: (2026)
by: Nie, Xiaohang, et al.
Published: (2026)
WebClipper: Efficient Evolution of Web Agents with Graph-based Trajectory Pruning
by: Wang, Junjie, et al.
Published: (2026)
by: Wang, Junjie, et al.
Published: (2026)
WebCompass: Towards Multimodal Web Coding Evaluation for Code Language Models
by: Lei, Xinping, et al.
Published: (2026)
by: Lei, Xinping, et al.
Published: (2026)
WebQuest: A Benchmark for Multimodal QA on Web Page Sequences
by: Wang, Maria, et al.
Published: (2024)
by: Wang, Maria, et al.
Published: (2024)
Tests as Prompt: A Test-Driven-Development Benchmark for LLM Code Generation
by: Cui, Yi
Published: (2025)
by: Cui, Yi
Published: (2025)
VeriTrip: A Verifiable Benchmark for Travel Planning Agents over Unstructured Web Corpora
by: Xu, Yuting, et al.
Published: (2026)
by: Xu, Yuting, et al.
Published: (2026)
WIPI: A New Web Threat for LLM-Driven Web Agents
by: Wu, Fangzhou, et al.
Published: (2024)
by: Wu, Fangzhou, et al.
Published: (2024)
SafeGenBench: A Benchmark Framework for Security Vulnerability Detection in LLM-Generated Code
by: Li, Xinghang, et al.
Published: (2025)
by: Li, Xinghang, et al.
Published: (2025)
WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games
by: Zhang, Wenyu, et al.
Published: (2026)
by: Zhang, Wenyu, et al.
Published: (2026)
Still Manual? Automated Linter Configuration via DSL-Based LLM Compilation of Coding Standards
by: Zhang, Zejun, et al.
Published: (2026)
by: Zhang, Zejun, et al.
Published: (2026)
WebDS: An End-to-End Benchmark for Web-based Data Science
by: Hsu, Ethan, et al.
Published: (2025)
by: Hsu, Ethan, et al.
Published: (2025)
Deep Research Bench: Evaluating AI Web Research Agents
by: FutureSearch, et al.
Published: (2025)
by: FutureSearch, et al.
Published: (2025)
QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation
by: Slim, Ali, et al.
Published: (2026)
by: Slim, Ali, et al.
Published: (2026)
DeepWeb-Bench: A Deep Research Benchmark Demanding Massive Cross-Source Evidence and Long-Horizon Derivation
by: Xie, Sixiong, et al.
Published: (2026)
by: Xie, Sixiong, et al.
Published: (2026)
VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding?
by: Liu, Junpeng, et al.
Published: (2024)
by: Liu, Junpeng, et al.
Published: (2024)
SMH-Bench: Benchmarking LLM Agents for Environment-Grounded Reasoning and Action in Smart Homes
by: Li, Kuan, et al.
Published: (2026)
by: Li, Kuan, et al.
Published: (2026)
WebDART: Dynamic Decomposition and Re-planning for Complex Web Tasks
by: Yang, Jingbo, et al.
Published: (2025)
by: Yang, Jingbo, et al.
Published: (2025)
Towards Adaptive ML Benchmarks: Web-Agent-Driven Construction, Domain Expansion, and Metric Optimization
by: Jia, Hangyi, et al.
Published: (2025)
by: Jia, Hangyi, et al.
Published: (2025)
Similar Items
-
WebApp1K: A Practical Code-Generation Benchmark for Web App Development
by: Cui, Yi
Published: (2024) -
Insights from Benchmarking Frontier Language Models on Web App Code Generation
by: Cui, Yi
Published: (2024) -
WARC-Bench: Web Archive Based Benchmark for GUI Subtask Executions
by: Srivastava, Sanjari, et al.
Published: (2025) -
WebCoderBench: Benchmarking Web Application Generation with Comprehensive and Interpretable Evaluation Metrics
by: Liu, Chenxu, et al.
Published: (2026) -
ST-WebAgentBench: A Benchmark for Evaluating Safety and Trustworthiness in Web Agents
by: Levy, Ido, et al.
Published: (2024)