:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	González, Sergio Gómez, Domingo, Miguel, Casacuberta, Francisco
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2602.19583
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Comparative Evaluation of Machine Translation Systems on Images with Text
by: Puchol, Blai, et al.
Published: (2026)

Two Spelling Normalization Approaches Based on Large Language Models
by: Domingo, Miguel, et al.
Published: (2025)

Segment-Based Interactive Machine Translation for Pre-trained Models
by: Navarro, Angel, et al.
Published: (2024)

Efficient Continual Learning in Neural Machine Translation: A Low-Rank Adaptation Approach
by: Carrión, Salvador, et al.
Published: (2025)

DEEP: Edge-based Dataflow Processing with Hybrid Docker Hub and Regional Registries
by: Mehran, Narges, et al.
Published: (2025)

Large-Scale Terminal Agentic Trajectory Generation from Dockerized Environments
by: Wu, Siwei, et al.
Published: (2026)

SWE-World: Building Software Engineering Agents in Docker-Free Environments
by: Sun, Shuang, et al.
Published: (2026)

OneKE: A Dockerized Schema-Guided LLM Agent-based Knowledge Extraction System
by: Luo, Yujie, et al.
Published: (2024)

Evaluation of Oncotimia: An LLM based system for supporting tumour boards
by: Lorenzo, Luis, et al.
Published: (2026)

ExecRepoBench: Multi-level Executable Code Completion Evaluation
by: Yang, Jian, et al.
Published: (2024)

Detecting Errors through Ensembling Prompts (DEEP): An End-to-End LLM Framework for Detecting Factual Errors
by: Chandler, Alex, et al.
Published: (2024)

L0-Reasoning Bench: Evaluating Procedural Correctness in Language Models via Simple Program Execution
by: Sun, Simeng, et al.
Published: (2025)

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
by: Chiang, Wei-Lin, et al.
Published: (2024)

From SWE-ZERO to SWE-HERO: Execution-free to Execution-based Fine-tuning for Software Engineering Agents
by: Ludwig, Nikolai, et al.
Published: (2026)

Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
by: Dekoninck, Jasper, et al.
Published: (2026)

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research
by: Bai, Xiaoyan, et al.
Published: (2026)

CodeScope: An Execution-based Multilingual Multitask Multidimensional Benchmark for Evaluating LLMs on Code Understanding and Generation
by: Yan, Weixiang, et al.
Published: (2023)

UltraEval: A Lightweight Platform for Flexible and Comprehensive Evaluation for LLMs
by: He, Chaoqun, et al.
Published: (2024)

EcoGym: Evaluating LLMs for Long-Horizon Plan-and-Execute in Interactive Economies
by: Hu, Xavier, et al.
Published: (2026)

Execution-Based Evaluation of Natural Language to Bash and PowerShell for Incident Remediation
by: Vo, Ngoc Phuoc An, et al.
Published: (2024)

VIDEE: Visual and Interactive Decomposition, Execution, and Evaluation of Text Analytics with Intelligent Agents
by: Lee, Sam Yu-Te, et al.
Published: (2025)

Cross-Platform Evaluation of Reasoning Capabilities in Foundation Models
by: de Curtò, J., et al.
Published: (2025)

Reinforcement Learning Problem Solving with Large Language Models
by: Gholamian, Sina, et al.
Published: (2024)

Jackal: A Real-World Execution-Based Benchmark Evaluating Large Language Models on Text-to-JQL Tasks
by: Frank, Kevin, et al.
Published: (2025)

MRAG-Suite: A Diagnostic Evaluation Platform for Visual Retrieval-Augmented Generation
by: Ji, Yuelyu, et al.
Published: (2025)

Towards Personalized Evaluation of Large Language Models with An Anonymous Crowd-Sourcing Platform
by: Cheng, Mingyue, et al.
Published: (2024)

Resource Management Schemes for Cloud-Native Platforms with Computing Containers of Docker and Kubernetes
by: Mao, Ying, et al.
Published: (2020)

ASPERA: A Simulated Environment to Evaluate Planning for Complex Action Execution
by: Coca, Alexandru, et al.
Published: (2025)

Large Language Model Critics for Execution-Free Evaluation of Code Changes
by: Yadavally, Aashish, et al.
Published: (2025)

The Self-Execution Benchmark: Measuring LLMs' Attempts to Overcome Their Lack of Self-Execution
by: Ezra, Elon, et al.
Published: (2025)

QuantCode-Bench: A Benchmark for Evaluating the Ability of Large Language Models to Generate Executable Algorithmic Trading Strategies
by: Khoroshilov, Alexey, et al.
Published: (2026)

CodeHalu: Investigating Code Hallucinations in LLMs via Execution-based Verification
by: Tian, Yuchen, et al.
Published: (2024)

CodeBenchGen: Creating Scalable Execution-based Code Generation Benchmarks
by: Xie, Yiqing, et al.
Published: (2024)

MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents
by: Wang, Xuehui, et al.
Published: (2025)

Query and Conquer: Execution-Guided SQL Generation
by: Borchmann, Łukasz, et al.
Published: (2025)

SPEED: Speculative Pipelined Execution for Efficient Decoding
by: Hooper, Coleman, et al.
Published: (2023)

SUPER: Evaluating Agents on Setting Up and Executing Tasks from Research Repositories
by: Bogin, Ben, et al.
Published: (2024)

Deflanderization for Game Dialogue: Balancing Character Authenticity with Task Execution in LLM-based NPCs
by: Buakhaw, Pasin, et al.
Published: (2025)

MGTEVAL: An Interactive Platform for Systemtic Evaluation of Machine-Generated Text Detectors
by: Li, Yuanfan, et al.
Published: (2026)

OpenCompass: A Universal Evaluation Platform for Large Language Models
by: Cao, Maosong, et al.
Published: (2026)