:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Gugg, Regina, Niederländer, Selina, Stöckl, Andreas, Flechl, Martin
Format:	Preprint
Published:	2026
Subjects:	Artificial Intelligence
Online Access:	https://arxiv.org/abs/2605.10639
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Exploring the Impact of Personality Traits on LLM Bias and Toxicity
by: Wang, Shuo, et al.
Published: (2025)

Generative Exaggeration in LLM Social Agents: Consistency, Bias, and Toxicity
by: Nudo, Jacopo, et al.
Published: (2025)

CyBiasBench: Benchmarking Bias in LLM Agents for Cyber-Attack Scenarios
by: Lim, Taein, et al.
Published: (2026)

Assessing Bias in Metric Models for LLM Open-Ended Generation Bias Benchmarks
by: Demchak, Nathaniel, et al.
Published: (2024)

Polyrating: A Cost-Effective and Bias-Aware Rating System for LLM Evaluation
by: Dekoninck, Jasper, et al.
Published: (2024)

The influence of persona and conversational task on social interactions with a LLM-controlled embodied conversational agent
by: Kroczek, Leon O. H., et al.
Published: (2024)

Persuasiveness and Bias in LLM: Investigating the Impact of Persuasiveness and Reinforcement of Bias in Language Models
by: Roy, Saumya
Published: (2025)

Beyond Consensus: Mitigating the Agreeableness Bias in LLM Judge Evaluations
by: Jain, Suryaansh, et al.
Published: (2025)

Fairness or Fluency? An Investigation into Language Bias of Pairwise LLM-as-a-Judge
by: Zhou, Xiaolin, et al.
Published: (2026)

Investigating Gender Bias in LLM-Generated Stories via Psychological Stereotypes
by: Masoudian, Shahed, et al.
Published: (2025)

Investigating the Impact of LLM Personality on Cognitive Bias Manifestation in Automated Decision-Making Tasks
by: He, Jiangen, et al.
Published: (2025)

Can LLMs Recognize Toxicity? A Structured Investigation Framework and Toxicity Metric
by: Koh, Hyukhun, et al.
Published: (2024)

Detection and Measurement of Hailstones with Multimodal Large Language Models
by: Alker, Moritz, et al.
Published: (2025)

Towards Provably Unbiased LLM Judges via Bias-Bounded Evaluation
by: Feuer, Benjamin, et al.
Published: (2026)

Social Bias in LLM-Generated Code: Benchmark and Mitigation
by: Rabbi, Fazle, et al.
Published: (2026)

IndoBias: A Dual Track Culturally Grounded Benchmark for LLMs Bias Evaluation in Indonesian Languages
by: Hanif, Ikhlasul Akmal, et al.
Published: (2026)

No LLM is Free From Bias: A Comprehensive Study of Bias Evaluation in Large Language Models
by: Kumar, Charaka Vinayak, et al.
Published: (2025)

Does Reasoning Introduce Bias? A Study of Social Bias Evaluation and Mitigation in LLM Reasoning
by: Wu, Xuyang, et al.
Published: (2025)

BiasScope: Towards Automated Detection of Bias in LLM-as-a-Judge Evaluation
by: Lai, Peng, et al.
Published: (2026)

Enhancing LLM-based Hatred and Toxicity Detection with Meta-Toxic Knowledge Graph
by: Zhao, Yibo, et al.
Published: (2024)

When Wording Steers the Evaluation: Framing Bias in LLM judges
by: Hwang, Yerin, et al.
Published: (2026)

LLM Bias Evaluation: Gender, Racial, and Age Disparities in Occupational and Crime Scenarios
by: Mirza, Vishal, et al.
Published: (2024)

Evaluating and Mitigating LLM-as-a-judge Bias in Communication Systems
by: Gao, Jiaxin, et al.
Published: (2025)

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation
by: Xu, Wenda, et al.
Published: (2025)

Who Gets the Mic? Investigating Gender Bias in the Speaker Assignment of a Speech-LLM
by: Puhach, Dariia, et al.
Published: (2025)

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks
by: Chandrasekar, Ashok, et al.
Published: (2026)

Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
by: Jin, Jiho, et al.
Published: (2025)

LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation
by: Ghosh, Himel, et al.
Published: (2026)

The Evaluation Game: Beyond Static LLM Benchmarking
by: Wang, Paul, et al.
Published: (2026)

Evaluation and Benchmarking of LLM Agents: A Survey
by: Mohammadi, Mahmoud, et al.
Published: (2025)

Who Benchmarks the Benchmarks? A Case Study of LLM Evaluation in Icelandic
by: Ingimundarson, Finnur Ágúst, et al.
Published: (2026)

Judging the Judges: A Systematic Evaluation of Bias Mitigation Strategies in LLM-as-a-Judge Pipelines
by: Soumik, Sadman Kabir
Published: (2026)

AI Benchmarks and Datasets for LLM Evaluation
by: Ivanov, Todor, et al.
Published: (2024)

Reasoning Language Models for complex assessments tasks: Evaluating parental cooperation from child protection case reports
by: Stoll, Dragan, et al.
Published: (2026)

Social Evolution of Published Text and The Emergence of Artificial Intelligence Through Large Language Models and The Problem of Toxicity and Bias
by: Khan, Arifa, et al.
Published: (2024)

Mobile-Env: Building Qualified Evaluation Benchmarks for LLM-GUI Interaction
by: Zhang, Danyang, et al.
Published: (2023)

On the Role of Speech Data in Reducing Toxicity Detection Bias
by: Bell, Samuel J., et al.
Published: (2024)

NC-Bench: An LLM Benchmark for Evaluating Conversational Competence
by: Moore, Robert J., et al.
Published: (2026)

Realistic Evaluation of Toxicity in Large Language Models
by: Luong, Tinh Son, et al.
Published: (2024)

BrowserArena: Evaluating LLM Agents on Real-World Web Navigation Tasks
by: Anupam, Sagnik, et al.
Published: (2025)