:: Library Catalog

Imagen de Portada

Guardado en:

Detalles Bibliográficos
Autores principales:	Deng, Hexuan, Ke, Xiaopeng, Li, Yichen, Hu, Ruina, Huang, Dehao, Wong, Derek F., Wang, Yue, Liu, Xuebo, Zhang, Min
Formato:	Preprint
Publicado:	2026
Materias:	Computation and Language Artificial Intelligence
Acceso en línea:	https://arxiv.org/abs/2605.07905
Etiquetas:	Agregar Etiqueta Sin Etiquetas, Sea el primero en etiquetar este registro!

Ejemplares similares

AQuilt: Weaving Logic and Self-Inspection into Low-Cost, High-Relevance Data Synthesis for Specialist LLMs
por: Ke, Xiaopeng, et al.
Publicado: (2025)

NewTerm: Benchmarking Real-Time New Terms for Large Language Models with Annual Updates
por: Deng, Hexuan, et al.
Publicado: (2024)

RouterKGQA: Specialized--General Model Routing for Constraint-Aware Knowledge Graph Question Answering
por: Yuan, Bo, et al.
Publicado: (2026)

REA-RL: Reflection-Aware Online Reinforcement Learning for Efficient Reasoning
por: Deng, Hexuan, et al.
Publicado: (2025)

DRPruning: Efficient Large Language Model Pruning through Distributionally Robust Optimization
por: Deng, Hexuan, et al.
Publicado: (2024)

Large Language Model for Multi-Domain Translation: Benchmarking and Domain CoT Fine-tuning
por: Hu, Tianxiang, et al.
Publicado: (2024)

Dynamic Sampling that Adapts: Self-Aware Iterative Data Persistent Optimization for Mathematical Reasoning
por: Rao, Jun, et al.
Publicado: (2025)

CoEvol: Constructing Better Responses for Instruction Finetuning through Multi-Agent Cooperation
por: Li, Renhao, et al.
Publicado: (2024)

Stop Rewarding Hallucinated Steps: Faithfulness-Aware Step-Level Reinforcement Learning for Small Reasoning Models
por: Nie, Shuo, et al.
Publicado: (2026)

SelectIT: Selective Instruction Tuning for LLMs via Uncertainty-Aware Self-Reflection
por: Liu, Liangxin, et al.
Publicado: (2024)

SGIC: A Self-Guided Iterative Calibration Framework for RAG
por: Chen, Guanhua, et al.
Publicado: (2025)

CDT: A Comprehensive Capability Framework for Large Language Models Across Cognition, Domain, and Task
por: Mo, Haosi, et al.
Publicado: (2025)

Who Wrote This? The Key to Zero-Shot LLM-Generated Text Detection Is GECScore
por: Wu, Junchao, et al.
Publicado: (2024)

Bench-2-CoP: Can We Trust Benchmarking for EU AI Compliance?
por: Prandi, Matteo, et al.
Publicado: (2025)

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory
por: Wang, Yutong, et al.
Publicado: (2024)

CoAct: Co-Active LLM Preference Learning with Human-AI Synergy
por: Xu, Ruiyao, et al.
Publicado: (2026)

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection
por: Wang, Yutong, et al.
Publicado: (2026)

APT: Improving Specialist LLM Performance with Weakness Case Acquisition and Iterative Preference Training
por: Rao, Jun, et al.
Publicado: (2025)

Investigating CoT Monitorability in Large Reasoning Models
por: Yang, Shu, et al.
Publicado: (2025)

CoCoNUTS: Concentrating on Content while Neglecting Uninformative Textual Styles for AI-Generated Peer Review Detection
por: Chen, Yihan, et al.
Publicado: (2025)

GuideBench: Benchmarking Domain-Oriented Guideline Following for LLM Agents
por: Diao, Lingxiao, et al.
Publicado: (2025)

RelevAI-Reviewer: A Benchmark on AI Reviewers for Survey Paper Relevance
por: Couto, Paulo Henrique, et al.
Publicado: (2024)

ExecRepoBench: Multi-level Executable Code Completion Evaluation
por: Yang, Jian, et al.
Publicado: (2024)

OmniScientist: Toward a Co-evolving Ecosystem of Human and AI Scientists
por: Shao, Chenyang, et al.
Publicado: (2025)

DentalBench: Benchmarking and Advancing LLMs Capability for Bilingual Dentistry Understanding
por: Zhu, Hengchuan, et al.
Publicado: (2025)

TOD-ProcBench: Benchmarking Complex Instruction-Following in Task-Oriented Dialogues
por: Ghazarian, Sarik, et al.
Publicado: (2025)

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing
por: Żurawicki, Krzysztof, et al.
Publicado: (2026)

OmniGenBench: A Modular Platform for Reproducible Genomic Foundation Models Benchmarking
por: Yang, Heng, et al.
Publicado: (2025)

CharacterBench: Benchmarking Character Customization of Large Language Models
por: Zhou, Jinfeng, et al.
Publicado: (2024)

DataSciBench: An LLM Agent Benchmark for Data Science
por: Zhang, Dan, et al.
Publicado: (2025)

Mem2ActBench: A Benchmark for Evaluating Long-Term Memory Utilization in Task-Oriented Autonomous Agents
por: Shen, Yiting, et al.
Publicado: (2026)

IF-RewardBench: Benchmarking Judge Models for Instruction-Following Evaluation
por: Wen, Bosi, et al.
Publicado: (2026)

BenchBench: Benchmarking Automated Benchmark Generation
por: Zheng, Yandan, et al.
Publicado: (2026)

MHRC-Bench: A Multilingual Hardware Repository-Level Code Completion benchmark
por: Zou, Qingyun, et al.
Publicado: (2026)

CoCoP: Enhancing Text Classification with LLM through Code Completion Prompt
por: Mohajeri, Mohammad Mahdi, et al.
Publicado: (2024)

CriticBench: Benchmarking LLMs for Critique-Correct Reasoning
por: Lin, Zicheng, et al.
Publicado: (2024)

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step
por: Deng, Yuntian, et al.
Publicado: (2024)

PodBench: A Comprehensive Benchmark for Instruction-Aware Audio-Oriented Podcast Script Generation
por: Xu, Chenning, et al.
Publicado: (2026)

Is Your Paper Being Reviewed by an LLM? Benchmarking AI Text Detection in Peer Review
por: Yu, Sungduk, et al.
Publicado: (2025)

ClawBench: Can AI Agents Complete Everyday Online Tasks?
por: Zhang, Yuxuan, et al.
Publicado: (2026)