Saved in:
| Main Authors: | Favero, Nicole, Salute, Francesca, Hardt, Daniel |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.00991 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
KatzBot: Revolutionizing Academic Chatbot for Enhanced Communication
by: Kumar, Sahil, et al.
Published: (2024)
by: Kumar, Sahil, et al.
Published: (2024)
Evaluating Commercial AI Chatbots as News Intermediaries
by: Suzgun, Mirac, et al.
Published: (2026)
by: Suzgun, Mirac, et al.
Published: (2026)
Evaluating language models as risk scores
by: Cruz, André F., et al.
Published: (2024)
by: Cruz, André F., et al.
Published: (2024)
ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
by: Contro, Jack, et al.
Published: (2025)
by: Contro, Jack, et al.
Published: (2025)
Test-Time Training on Nearest Neighbors for Large Language Models
by: Hardt, Moritz, et al.
Published: (2023)
by: Hardt, Moritz, et al.
Published: (2023)
Training on the Test Task Confounds Evaluation and Emergence
by: Dominguez-Olmedo, Ricardo, et al.
Published: (2024)
by: Dominguez-Olmedo, Ricardo, et al.
Published: (2024)
Advancing Risk and Quality Assurance: A RAG Chatbot for Improved Regulatory Compliance
by: Hillebrand, Lars, et al.
Published: (2025)
by: Hillebrand, Lars, et al.
Published: (2025)
LMStyle Benchmark: Evaluating Text Style Transfer for Chatbots
by: Chen, Jianlin
Published: (2024)
by: Chen, Jianlin
Published: (2024)
TherapyGym: Evaluating and Aligning Clinical Fidelity and Safety in Therapy Chatbots
by: Huang, Fangrui, et al.
Published: (2026)
by: Huang, Fangrui, et al.
Published: (2026)
End-to-End Chatbot Evaluation with Adaptive Reasoning and Uncertainty Filtering
by: Dang, Nhi, et al.
Published: (2026)
by: Dang, Nhi, et al.
Published: (2026)
Answer Matching Outperforms Multiple Choice for Language Model Evaluation
by: Chandak, Nikhil, et al.
Published: (2025)
by: Chandak, Nikhil, et al.
Published: (2025)
Questioning the Survey Responses of Large Language Models
by: Dominguez-Olmedo, Ricardo, et al.
Published: (2023)
by: Dominguez-Olmedo, Ricardo, et al.
Published: (2023)
Limits to Predicting Online Speech Using Large Language Models
by: Remeli, Mina, et al.
Published: (2024)
by: Remeli, Mina, et al.
Published: (2024)
SESGO: Spanish Evaluation of Stereotypical Generative Outputs
by: Robles, Melissa, et al.
Published: (2025)
by: Robles, Melissa, et al.
Published: (2025)
LLMs Are Biased Towards Output Formats! Systematically Evaluating and Mitigating Output Format Bias of LLMs
by: Long, Do Xuan, et al.
Published: (2024)
by: Long, Do Xuan, et al.
Published: (2024)
ARAGOG: Advanced RAG Output Grading
by: Eibich, Matouš, et al.
Published: (2024)
by: Eibich, Matouš, et al.
Published: (2024)
Advancing Fairness in Natural Language Processing: From Traditional Methods to Explainability
by: Jourdan, Fanny
Published: (2024)
by: Jourdan, Fanny
Published: (2024)
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
by: Chiang, Wei-Lin, et al.
Published: (2024)
by: Chiang, Wei-Lin, et al.
Published: (2024)
An Improved Traditional Chinese Evaluation Suite for Foundation Model
by: Tam, Zhi-Rui, et al.
Published: (2024)
by: Tam, Zhi-Rui, et al.
Published: (2024)
LLM as a Scorer: The Impact of Output Order on Dialogue Evaluation
by: Chen, Yi-Pei, et al.
Published: (2024)
by: Chen, Yi-Pei, et al.
Published: (2024)
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)
by: Chen, Junjie, et al.
Published: (2026)
Domain-Specific Improvement on Psychotherapy Chatbot Using Assistant
by: Kang, Cheng, et al.
Published: (2024)
by: Kang, Cheng, et al.
Published: (2024)
Building Benchmarks from the Ground Up: Community-Centered Evaluation of LLMs in Healthcare Chatbot Settings
by: Hamna, Hamna, et al.
Published: (2025)
by: Hamna, Hamna, et al.
Published: (2025)
Arabic Chatbot Technologies in Education: An Overview
by: Bourhil, Hicham, et al.
Published: (2025)
by: Bourhil, Hicham, et al.
Published: (2025)
Evaluation of LLM Chatbots for OSINT-based Cyber Threat Awareness
by: Shafee, Samaneh, et al.
Published: (2024)
by: Shafee, Samaneh, et al.
Published: (2024)
On the Implications of Verbose LLM Outputs: A Case Study in Translation Evaluation
by: Briakou, Eleftheria, et al.
Published: (2024)
by: Briakou, Eleftheria, et al.
Published: (2024)
A Course Shared Task on Evaluating LLM Output for Clinical Questions
by: Hou, Yufang, et al.
Published: (2024)
by: Hou, Yufang, et al.
Published: (2024)
First-Person Fairness in Chatbots
by: Eloundou, Tyna, et al.
Published: (2024)
by: Eloundou, Tyna, et al.
Published: (2024)
How Well Can LLMs Echo Us? Evaluating AI Chatbots' Role-Play Ability with ECHO
by: Ng, Man Tik, et al.
Published: (2024)
by: Ng, Man Tik, et al.
Published: (2024)
The Structured Output Benchmark: A Multi-Source Benchmark for Evaluating Structured Output Quality in Large Language Models
by: Singh, Abhinav Kumar, et al.
Published: (2026)
by: Singh, Abhinav Kumar, et al.
Published: (2026)
Through the Prism of Culture: Evaluating LLMs' Understanding of Indian Subcultures and Traditions
by: Chhikara, Garima, et al.
Published: (2025)
by: Chhikara, Garima, et al.
Published: (2025)
ELLIS Alicante at CQs-Gen 2025: Winning the critical thinking questions shared task: LLM-based question generation and selection
by: Favero, Lucile, et al.
Published: (2025)
by: Favero, Lucile, et al.
Published: (2025)
Distinguishing Chatbot from Human
by: Godghase, Gauri Anil, et al.
Published: (2024)
by: Godghase, Gauri Anil, et al.
Published: (2024)
Task-Dependent Evaluation of LLM Output Homogenization: A Taxonomy-Guided Framework
by: Jain, Shomik, et al.
Published: (2025)
by: Jain, Shomik, et al.
Published: (2025)
Mixed Chain-of-Psychotherapies for Emotional Support Chatbot
by: Chen, Siyuan, et al.
Published: (2024)
by: Chen, Siyuan, et al.
Published: (2024)
LLM Roleplay: Simulating Human-Chatbot Interaction
by: Tamoyan, Hovhannes, et al.
Published: (2024)
by: Tamoyan, Hovhannes, et al.
Published: (2024)
Sólo Escúchame: Spanish Emotional Accompaniment Chatbot
by: Ramírez, Bruno Gil, et al.
Published: (2024)
by: Ramírez, Bruno Gil, et al.
Published: (2024)
Empirical Study of Symmetrical Reasoning in Conversational Chatbots
by: Rim, Daniela N., et al.
Published: (2024)
by: Rim, Daniela N., et al.
Published: (2024)
A Comparison of LLM Finetuning Methods & Evaluation Metrics with Travel Chatbot Use Case
by: Meyer, Sonia, et al.
Published: (2024)
by: Meyer, Sonia, et al.
Published: (2024)
Scaling Open-Ended Reasoning to Predict the Future
by: Chandak, Nikhil, et al.
Published: (2025)
by: Chandak, Nikhil, et al.
Published: (2025)
Similar Items
-
KatzBot: Revolutionizing Academic Chatbot for Enhanced Communication
by: Kumar, Sahil, et al.
Published: (2024) -
Evaluating Commercial AI Chatbots as News Intermediaries
by: Suzgun, Mirac, et al.
Published: (2026) -
Evaluating language models as risk scores
by: Cruz, André F., et al.
Published: (2024) -
ChatbotManip: A Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
by: Contro, Jack, et al.
Published: (2025) -
Test-Time Training on Nearest Neighbors for Large Language Models
by: Hardt, Moritz, et al.
Published: (2023)