:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Favero, Nicole, Salute, Francesca, Hardt, Daniel
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2512.00991
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

KatzBot: Revolutionizing Academic Chatbot for Enhanced Communication
by: Kumar, Sahil, et al.
Published: (2024)

Evaluating Commercial AI Chatbots as News Intermediaries
by: Suzgun, Mirac, et al.
Published: (2026)

Evaluating language models as risk scores
by: Cruz, André F., et al.
Published: (2024)

ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
by: Contro, Jack, et al.
Published: (2025)

Test-Time Training on Nearest Neighbors for Large Language Models
by: Hardt, Moritz, et al.
Published: (2023)

Training on the Test Task Confounds Evaluation and Emergence
by: Dominguez-Olmedo, Ricardo, et al.
Published: (2024)

Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance
by: Hillebrand, Lars, et al.
Published: (2025)

LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots
by: Chen, Jianlin
Published: (2024)

TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
by: Huang, Fangrui, et al.
Published: (2026)

End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering
by: Dang, Nhi, et al.
Published: (2026)

Answer Matching Outperforms Multiple Choice for Language Model Evaluation
by: Chandak, Nikhil, et al.
Published: (2025)

Questioning the Survey Responses of Large Language Models
by: Dominguez-Olmedo, Ricardo, et al.
Published: (2023)

Limits to Predicting Online Speech Using Large Language Models
by: Remeli, Mina, et al.
Published: (2024)

SESGO: Spanish Evaluation of Stereotypical Generative Outputs
by: Robles, Melissa, et al.
Published: (2025)

LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
by: Long, Do Xuan, et al.
Published: (2024)

ARAGOG: Advanced RAG Output Grading
by: Eibich, Matouš, et al.
Published: (2024)

Advancing Fairness in Natural Language Processing: From Traditional Methods to Explainability
by: Jourdan, Fanny
Published: (2024)

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
by: Chiang, Wei-Lin, et al.
Published: (2024)

An Improved Traditional Chinese Evaluation Suite for Foundation Model
by: Tam, Zhi-Rui, et al.
Published: (2024)

LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
by: Chen, Yi-Pei, et al.
Published: (2024)

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)

Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant
by: Kang, Cheng, et al.
Published: (2024)

Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings
by: Hamna, Hamna, et al.
Published: (2025)

Arabic Chatbot Technologies in Education: An Overview
by: Bourhil, Hicham, et al.
Published: (2025)

Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness
by: Shafee, Samaneh, et al.
Published: (2024)

On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
by: Briakou, Eleftheria, et al.
Published: (2024)

A Course Shared Task on Evaluating LLM Output for Clinical Questions
by: Hou, Yufang, et al.
Published: (2024)

First-Person Fairness in Chatbots
by: Eloundou, Tyna, et al.
Published: (2024)

How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO
by: Ng, Man Tik, et al.
Published: (2024)

The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
by: Singh, Abhinav Kumar, et al.
Published: (2026)

Through the Prism of Culture: Evaluating LLMs' Understanding of Indian Subcultures and Traditions
by: Chhikara, Garima, et al.
Published: (2025)

ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
by: Favero, Lucile, et al.
Published: (2025)

Distinguishing Chatbot from Human
by: Godghase, Gauri Anil, et al.
Published: (2024)

Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework
by: Jain, Shomik, et al.
Published: (2025)

Mixed Chain-of-Psychotherapies for Emotional Support Chatbot
by: Chen, Siyuan, et al.
Published: (2024)

LLM Roleplay: Simulating Human-Chatbot Interaction
by: Tamoyan, Hovhannes, et al.
Published: (2024)

Sólo Escúchame: Spanish Emotional Accompaniment Chatbot
by: Ramírez, Bruno Gil, et al.
Published: (2024)

Empirical Study of Symmetrical Reasoning in Conversational Chatbots
by: Rim, Daniela N., et al.
Published: (2024)

A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case
by: Meyer, Sonia, et al.
Published: (2024)

Scaling Open-Ended Reasoning to Predict the Future
by: Chandak, Nikhil, et al.
Published: (2025)