Saved in:
| Main Authors: | González, Sergio Gómez, Domingo, Miguel, Casacuberta, Francisco |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.19583 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Comparative Evaluation of Machine Translation Systems on Images with Text
by: Puchol, Blai, et al.
Published: (2026)
by: Puchol, Blai, et al.
Published: (2026)
Two Spelling Normalization Approaches Based on Large Language Models
by: Domingo, Miguel, et al.
Published: (2025)
by: Domingo, Miguel, et al.
Published: (2025)
Segment-Based Interactive Machine Translation for Pre-trained Models
by: Navarro, Angel, et al.
Published: (2024)
by: Navarro, Angel, et al.
Published: (2024)
Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach
by: Carrión, Salvador, et al.
Published: (2025)
by: Carrión, Salvador, et al.
Published: (2025)
DEEP: Edge-based Dataflow Processing with Hybrid Docker Hub and Regional Registries
by: Mehran, Narges, et al.
Published: (2025)
by: Mehran, Narges, et al.
Published: (2025)
Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments
by: Wu, Siwei, et al.
Published: (2026)
by: Wu, Siwei, et al.
Published: (2026)
SWE-World: Building Software Engineering Agents in Docker-Free Environments
by: Sun, Shuang, et al.
Published: (2026)
by: Sun, Shuang, et al.
Published: (2026)
OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
by: Luo, Yujie, et al.
Published: (2024)
by: Luo, Yujie, et al.
Published: (2024)
Evaluation of Oncotimia: An LLM based system for supporting tumour boards
by: Lorenzo, Luis, et al.
Published: (2026)
by: Lorenzo, Luis, et al.
Published: (2026)
ExecRepoBench: Multi-level Executable Code Completion Evaluation
by: Yang, Jian, et al.
Published: (2024)
by: Yang, Jian, et al.
Published: (2024)
Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
by: Chandler, Alex, et al.
Published: (2024)
by: Chandler, Alex, et al.
Published: (2024)
L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution
by: Sun, Simeng, et al.
Published: (2025)
by: Sun, Simeng, et al.
Published: (2025)
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
by: Chiang, Wei-Lin, et al.
Published: (2024)
by: Chiang, Wei-Lin, et al.
Published: (2024)
From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
by: Ludwig, Nikolai, et al.
Published: (2026)
by: Ludwig, Nikolai, et al.
Published: (2026)
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
by: Dekoninck, Jasper, et al.
Published: (2026)
by: Dekoninck, Jasper, et al.
Published: (2026)
The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research
by: Bai, Xiaoyan, et al.
Published: (2026)
by: Bai, Xiaoyan, et al.
Published: (2026)
CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
by: Yan, Weixiang, et al.
Published: (2023)
by: Yan, Weixiang, et al.
Published: (2023)
UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs
by: He, Chaoqun, et al.
Published: (2024)
by: He, Chaoqun, et al.
Published: (2024)
EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
by: Hu, Xavier, et al.
Published: (2026)
by: Hu, Xavier, et al.
Published: (2026)
Execution-Based Evaluation of Natural Language to Bash and PowerShell for Incident Remediation
by: Vo, Ngoc Phuoc An, et al.
Published: (2024)
by: Vo, Ngoc Phuoc An, et al.
Published: (2024)
VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
by: Lee, Sam Yu-Te, et al.
Published: (2025)
by: Lee, Sam Yu-Te, et al.
Published: (2025)
Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
by: de Curtò, J., et al.
Published: (2025)
by: de Curtò, J., et al.
Published: (2025)
Reinforcement Learning Problem Solving with Large Language Models
by: Gholamian, Sina, et al.
Published: (2024)
by: Gholamian, Sina, et al.
Published: (2024)
Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks
by: Frank, Kevin, et al.
Published: (2025)
by: Frank, Kevin, et al.
Published: (2025)
MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation
by: Ji, Yuelyu, et al.
Published: (2025)
by: Ji, Yuelyu, et al.
Published: (2025)
Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform
by: Cheng, Mingyue, et al.
Published: (2024)
by: Cheng, Mingyue, et al.
Published: (2024)
Resource Management Schemes for Cloud-Native Platforms with Computing Containers of Docker and Kubernetes
by: Mao, Ying, et al.
Published: (2020)
by: Mao, Ying, et al.
Published: (2020)
ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
by: Coca, Alexandru, et al.
Published: (2025)
by: Coca, Alexandru, et al.
Published: (2025)
Large Language Model Critics for Execution-Free Evaluation of Code Changes
by: Yadavally, Aashish, et al.
Published: (2025)
by: Yadavally, Aashish, et al.
Published: (2025)
The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution
by: Ezra, Elon, et al.
Published: (2025)
by: Ezra, Elon, et al.
Published: (2025)
QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
by: Khoroshilov, Alexey, et al.
Published: (2026)
by: Khoroshilov, Alexey, et al.
Published: (2026)
CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification
by: Tian, Yuchen, et al.
Published: (2024)
by: Tian, Yuchen, et al.
Published: (2024)
CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
by: Xie, Yiqing, et al.
Published: (2024)
by: Xie, Yiqing, et al.
Published: (2024)
MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
by: Wang, Xuehui, et al.
Published: (2025)
by: Wang, Xuehui, et al.
Published: (2025)
Query and Conquer: Execution-Guided SQL Generation
by: Borchmann, Łukasz, et al.
Published: (2025)
by: Borchmann, Łukasz, et al.
Published: (2025)
SPEED: Speculative Pipelined Execution for Efficient Decoding
by: Hooper, Coleman, et al.
Published: (2023)
by: Hooper, Coleman, et al.
Published: (2023)
SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
by: Bogin, Ben, et al.
Published: (2024)
by: Bogin, Ben, et al.
Published: (2024)
Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
by: Buakhaw, Pasin, et al.
Published: (2025)
by: Buakhaw, Pasin, et al.
Published: (2025)
MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text Detectors
by: Li, Yuanfan, et al.
Published: (2026)
by: Li, Yuanfan, et al.
Published: (2026)
OpenCompass: A Universal Evaluation Platform for Large Language Models
by: Cao, Maosong, et al.
Published: (2026)
by: Cao, Maosong, et al.
Published: (2026)
Similar Items
-
Comparative Evaluation of Machine Translation Systems on Images with Text
by: Puchol, Blai, et al.
Published: (2026) -
Two Spelling Normalization Approaches Based on Large Language Models
by: Domingo, Miguel, et al.
Published: (2025) -
Segment-Based Interactive Machine Translation for Pre-trained Models
by: Navarro, Angel, et al.
Published: (2024) -
Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach
by: Carrión, Salvador, et al.
Published: (2025) -
DEEP: Edge-based Dataflow Processing with Hybrid Docker Hub and Regional Registries
by: Mehran, Narges, et al.
Published: (2025)