:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Hakimov, Sherzod, Bernard, Roland, Leiber, Tim, Osswald, Karl, Richert, Kristina, Yang, Ruilin, Bernardi, Raffaella, Schlangen, David
Format:	Preprint
Published:	2025
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2510.08098
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely
by: Kranti, Chalamalasetti, et al.
Published: (2026)

From Templates to Natural Language: Generalization Challenges in Instruction-Tuned LLMs for Spatial Reasoning
by: Kranti, Chalamalasetti, et al.
Published: (2025)

Ad-hoc Concept Forming in the Game Codenames as a Means for Evaluating Large Language Models
by: Hakimov, Sherzod, et al.
Published: (2025)

Sharing the Cost of Success: A Game for Evaluating and Learning Collaborative Multi-Agent Instruction Giving and Following Policies
by: Sadler, Philipp, et al.
Published: (2024)

Towards No-Code Programming of Cobots: Experiments with Code Synthesis by Large Code Models for Conversational Programming
by: Kranti, Chalamalasetti, et al.
Published: (2024)

Plant in Cupboard, Orange on Rably, Inat Aphone. Benchmarking Incremental Learning of Situation and Language Model using a Text-Simulated Situated Environment
by: Jordan, Jonathan, et al.
Published: (2025)

Retrieval-Augmented Code Generation for Situated Action Generation: A Case Study on Minecraft
by: Kranti, Chalamalasetti, et al.
Published: (2024)

clem:todd: A Framework for the Systematic Benchmarking of LLM-Based Task-Oriented Dialogue System Realisations
by: Kranti, Chalamalasetti, et al.
Published: (2025)

Learning Communication Policies for Different Follower Behaviors in a Collaborative Reference Game
by: Sadler, Philipp, et al.
Published: (2024)

How Many Parameters Does it Take to Change a Light Bulb? Evaluating Performance in Self-Play of Conversational Games as a Function of Model Characteristics
by: Bhavsar, Nidhir, et al.
Published: (2024)

TurkicNLP: An NLP Toolkit for Turkic Languages
by: Hakimov, Sherzod
Published: (2026)

clembench-2024: A Challenging, Dynamic, Complementary, Multilingual Benchmark and Underlying Flexible Framework for LLMs as Multi-Action Agents
by: Beyer, Anne, et al.
Published: (2024)

The Image Reconstruction Game: Drawing Common Ground Through Iterative Multimodal Dialogue
by: Hakimov, Sherzod, et al.
Published: (2026)

Unveiling Global Narratives: A Multilingual Twitter Dataset of News Media on the Russo-Ukrainian Conflict
by: Hakimov, Sherzod, et al.
Published: (2023)

M2SA: Multimodal and Multilingual Model for Sentiment Analysis of Tweets
by: Thakkar, Gaurish, et al.
Published: (2024)

A Third Paradigm for LLM Evaluation: Dialogue Game-Based Evaluation using clembench
by: Schlangen, David, et al.
Published: (2025)

Using Game Play to Investigate Multimodal and Conversational Grounding in Large Multimodal Models
by: Hakimov, Sherzod, et al.
Published: (2024)

A Systematic Analysis of Large Language Models as Soft Reasoners: The Case of Syllogistic Inferences
by: Bertolazzi, Leonardo, et al.
Published: (2024)

Free-text Rationale Generation under Readability Level Control
by: Hsu, Yi-Sheng, et al.
Published: (2024)

Playpen: An Environment for Exploring Learning Through Conversational Interaction
by: Horst, Nicola, et al.
Published: (2025)

What Are We Measuring in NLG? A Meta-Analysis of Evaluation Trends 2020-2025
by: Yang, Jing, et al.
Published: (2026)

How Language Models Conflate Logical Validity with Plausibility: A Representational Analysis of Content Effects
by: Bertolazzi, Leonardo, et al.
Published: (2025)

Strategic Responses to Personalized Pricing and Demand for Privacy: An Experiment
by: Bó, Inácio, et al.
Published: (2023)

Representations of Fact, Fiction and Forecast in Large Language Models: Epistemics and Attitudes
by: Li, Meng, et al.
Published: (2025)

The Validation Gap: A Mechanistic Analysis of How Language Models Compute Arithmetic but Fail to Validate It
by: Bertolazzi, Leonardo, et al.
Published: (2025)

Triangulating LLM Progress through Benchmarks, Games, and Cognitive Tests
by: Momentè, Filippo, et al.
Published: (2025)

Taking Action Towards Graceful Interaction: The Effects of Performing Actions on Modelling Policies for Instruction Clarification Requests
by: Madureira, Brielen, et al.
Published: (2024)

LLMs as Function Approximators: Terminology, Taxonomy, and Questions for Evaluation
by: Schlangen, David
Published: (2024)

SoT: Structured-of-Thought Prompting Guides Multilingual Reasoning in Large Language Models
by: Qi, Rui, et al.
Published: (2025)

The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models
by: Fan, Siqi, et al.
Published: (2025)

Diastereoselective Synthesis of Pyridone ribo‐C‐Nucleosides via Heck Reaction and Oxidation
by: Tim Gniech, et al.
Published: (2024)

Front Cover: Diastereoselective Synthesis of Pyridone ribo‐C‐Nucleosides via Heck Reaction and Oxidation (Eur. J. Org. Chem. 28/2024)
by: Tim Gniech, et al.
Published: (2024)

Medicina e arte uma ressonância
by: Walter Osswald
Published: (2002)

Aspectos de autoridad y poder en las ceremonias de canonización de Ignacio de Loyola y Francisco Javier en Portugal
by: Cristina Osswald
Published: (2013)

On Christian Martyrdom in Japan (1597-1658)
by: Cristina Osswald
Published: (2021)

EL OLVIDO EN LA FENOMENOLOGÍA DE HUSSERL. DOS FENÓMENOS LÍMITE
by: Andrés Osswald
Published: (2017)

A perceção dos jesuítas no mundo português: entre o trato de e o gosto por orientalia (sécs. XVI-XVII)
by: Cristina Osswald
Published: (2018)

Sete notas sobre a cura pelo nada
by: Walter Osswald
Published: (2003)

Las narrativas de "las pioneras". Cuestiones de género y moralidades en el desarrollo de la danza moderna en la Argentina (1940-1960)
by: Denise Osswald
Published: (2010)

SR-FoT: A Syllogistic-Reasoning Framework of Thought for Large Language Models Tackling Knowledge-based Reasoning Tasks
by: Wan, Wentao, et al.
Published: (2025)