:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Cook, Jonathan, Rocktäschel, Tim, Foerster, Jakob, Aumiller, Dennis, Wang, Alex
Format:	Preprint
Published:	2024
Subjects:	Artificial Intelligence Computation and Language Human-Computer Interaction Machine Learning
Online Access:	https://arxiv.org/abs/2410.03608
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Creative Beam Search: LLM-as-a-Judge For Improving Response Generation
by: Franceschelli, Giorgio, et al.
Published: (2024)

LLM Attributor: Interactive Visual Attribution for LLM Generation
by: Lee, Seongmin, et al.
Published: (2024)

PREF: Reference-Free Evaluation of Personalised Text Generation in LLMs
by: Fu, Xiao, et al.
Published: (2025)

DigiData: Training and Evaluating General-Purpose Mobile Control Agents
by: Sun, Yuxuan, et al.
Published: (2025)

Properties and Challenges of LLM-Generated Explanations
by: Kunz, Jenny, et al.
Published: (2024)

Large Language Models for Cancer Communication: Evaluating Linguistic Quality, Safety, and Accessibility in Generative AI
by: Saha, Agnik, et al.
Published: (2025)

Generative UI: LLMs are Effective UI Generators
by: Leviathan, Yaniv, et al.
Published: (2026)

Can Generative AI Support Patients' & Caregivers' Informational Needs? Towards Task-Centric Evaluation Of AI Systems
by: Rajagopal, Shreya, et al.
Published: (2024)

LLM Comparator: Visual Analytics for Side-by-Side Evaluation of Large Language Models
by: Kahng, Minsuk, et al.
Published: (2024)

Can LLM-Generated Misinformation Be Detected?
by: Chen, Canyu, et al.
Published: (2023)

Programming by Backprop: An Instruction is Worth 100 Examples When Finetuning LLMs
by: Cook, Jonathan, et al.
Published: (2025)

ABLEIST: Intersectional Disability Bias in LLM-Generated Hiring Scenarios
by: Phutane, Mahika, et al.
Published: (2025)

Building Trust in Mental Health Chatbots: Safety Metrics and LLM-Based Evaluation Tools
by: Park, Jung In, et al.
Published: (2024)

The Behavior Gap: Evaluating Zero-shot LLM Agents in Complex Task-Oriented Dialogs
by: Baidya, Avinash, et al.
Published: (2025)

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?
by: Kim, Jane Paik
Published: (2026)

"They are uncultured": Unveiling Covert Harms and Social Threats in LLM Generated Conversations
by: Dammu, Preetam Prabhu Srikar, et al.
Published: (2024)

The Ideation-Execution Gap: Execution Outcomes of LLM-Generated versus Human Research Ideas
by: Si, Chenglei, et al.
Published: (2025)

Transformer Explainer: Interactive Learning of Text-Generative Models
by: Cho, Aeree, et al.
Published: (2024)

Survey of User Interface Design and Interaction Techniques in Generative AI Applications
by: Luera, Reuben, et al.
Published: (2024)

Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results
by: Liu, Jonathan, et al.
Published: (2025)

The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind
by: Lupu, Andrei, et al.
Published: (2025)

Agent Laboratory: Using LLM Agents as Research Assistants
by: Schmidgall, Samuel, et al.
Published: (2025)

DiscoverLLM: From Executing Intents to Discovering Them
by: Kim, Tae Soo, et al.
Published: (2026)

Policy Maps: Tools for Guiding the Unbounded Space of LLM Behaviors
by: Lam, Michelle S., et al.
Published: (2024)

UniAutoML: A Human-Centered Framework for Unified Discriminative and Generative AutoML with Large Language Models
by: Guo, Jiayi, et al.
Published: (2024)

SPRIG: Improving Large Language Model Performance by System Prompt Optimization
by: Zhang, Lechen, et al.
Published: (2024)

Multimodal Behavioral Patterns Analysis with Eye-Tracking and LLM-Based Reasoning
by: Guo, Dongyang, et al.
Published: (2025)

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
by: Wu, Xiaoyuan, et al.
Published: (2025)

Language Models as Zero-Shot Trajectory Generators
by: Kwon, Teyun, et al.
Published: (2023)

Talking with Oompa Loompas: A novel framework for evaluating linguistic acquisition of LLM agents
by: Swain, Sankalp Tattwadarshi, et al.
Published: (2025)

Cross-Lingual Prompt Steerability: Towards Accurate and Robust LLM Behavior across Languages
by: Zhang, Lechen, et al.
Published: (2025)

Call2Instruct: Automated Pipeline for Generating Q&A Datasets from Call Center Recordings for LLM Fine-Tuning
by: Echeverria, Alex, et al.
Published: (2025)

Improving Dialogue Agents by Decomposing One Global Explicit Annotation with Local Implicit Multimodal Feedback
by: Lee, Dong Won, et al.
Published: (2024)

Heterogeneous Value Alignment Evaluation for Large Language Models
by: Zhang, Zhaowei, et al.
Published: (2023)

Never Start from Scratch: Expediting On-Device LLM Personalization via Explainable Model Selection
by: Wang, Haoming, et al.
Published: (2025)

Evaluating Large Language Models for Health-related Queries with Presuppositions
by: Kaur, Navreet, et al.
Published: (2023)

PRECISE Framework: GPT-based Text For Improved Readability, Reliability, and Understandability of Radiology Reports For Patient-Centered Care
by: Tripathi, Satvik, et al.
Published: (2024)

Can LLM feedback enhance review quality? A randomized study of 20K reviews at ICLR 2025
by: Thakkar, Nitya, et al.
Published: (2025)

AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science
by: Zeng, Qiuhai, et al.
Published: (2025)

EduAgent: Generative Student Agents in Learning
by: Xu, Songlin, et al.
Published: (2024)