Saved in:
| Main Authors: | Myung, Junho, Park, Yeon Su, Kim, Sunwoo, Yoo, Shin, Oh, Alice |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.21961 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
by: Jin, Jiho, et al.
Published: (2025)
by: Jin, Jiho, et al.
Published: (2025)
ChEDDAR: Student-ChatGPT Dialogue in EFL Writing Education
by: Han, Jieun, et al.
Published: (2023)
by: Han, Jieun, et al.
Published: (2023)
RECIPE4U: Student-ChatGPT Interaction Dataset in EFL Writing Education
by: Han, Jieun, et al.
Published: (2024)
by: Han, Jieun, et al.
Published: (2024)
On the Effect of Uncertainty on Layer-wise Inference Dynamics
by: Kim, Sunwoo, et al.
Published: (2025)
by: Kim, Sunwoo, et al.
Published: (2025)
LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
by: Han, Jieun, et al.
Published: (2023)
by: Han, Jieun, et al.
Published: (2023)
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
by: Jin, Jiho, et al.
Published: (2026)
by: Jin, Jiho, et al.
Published: (2026)
Entangled in Representations: Mechanistic Investigation of Cultural Biases in Large Language Models
by: Yu, Haeun, et al.
Published: (2025)
by: Yu, Haeun, et al.
Published: (2025)
MentalBench: A DSM-Grounded Benchmark for Evaluating Psychiatric Diagnostic Capability of Large Language Models
by: Song, Hoyun, et al.
Published: (2026)
by: Song, Hoyun, et al.
Published: (2026)
Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models
by: Jung, Chani, et al.
Published: (2024)
by: Jung, Chani, et al.
Published: (2024)
FINEST: Improving LLM Responses to Sensitive Topics Through Fine-Grained Evaluation
by: Oh, Juhyun, et al.
Published: (2026)
by: Oh, Juhyun, et al.
Published: (2026)
Benchmarking Cognitive Biases in Large Language Models as Evaluators
by: Koo, Ryan, et al.
Published: (2023)
by: Koo, Ryan, et al.
Published: (2023)
Flex-TravelPlanner: A Benchmark for Flexible Planning with Language Agents
by: Oh, Juhyun, et al.
Published: (2025)
by: Oh, Juhyun, et al.
Published: (2025)
Code-Switching In-Context Learning for Cross-Lingual Transfer of Large Language Models
by: Yoo, Haneul, et al.
Published: (2025)
by: Yoo, Haneul, et al.
Published: (2025)
Survey of Cultural Awareness in Language Models: Text and Beyond
by: Pawar, Siddhesh, et al.
Published: (2024)
by: Pawar, Siddhesh, et al.
Published: (2024)
What if...?: Thinking Counterfactual Keywords Helps to Mitigate Hallucination in Large Multi-modal Models
by: Kim, Junho, et al.
Published: (2024)
by: Kim, Junho, et al.
Published: (2024)
Exploring Cross-Cultural Differences in English Hate Speech Annotations: From Dataset Construction to Analysis
by: Lee, Nayeon, et al.
Published: (2023)
by: Lee, Nayeon, et al.
Published: (2023)
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
by: Kim, Eunsu, et al.
Published: (2024)
by: Kim, Eunsu, et al.
Published: (2024)
DREsS: Dataset for Rubric-based Essay Scoring on EFL Writing
by: Yoo, Haneul, et al.
Published: (2024)
by: Yoo, Haneul, et al.
Published: (2024)
BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2025)
by: Kim, Eunsu, et al.
Published: (2025)
MELT: Materials-aware Continued Pre-training for Language Model Adaptation to Materials Science
by: Kim, Junho, et al.
Published: (2024)
by: Kim, Junho, et al.
Published: (2024)
English Please: Evaluating Machine Translation with Large Language Models for Multilingual Bug Reports
by: Patil, Avinash, et al.
Published: (2025)
by: Patil, Avinash, et al.
Published: (2025)
RoleConflictBench: A Benchmark of Role Conflict Scenarios for Evaluating LLMs' Contextual Sensitivity
by: Shin, Jisu, et al.
Published: (2025)
by: Shin, Jisu, et al.
Published: (2025)
OLA: Output Language Alignment in Code-Switched LLM Interactions
by: Oh, Juhyun, et al.
Published: (2026)
by: Oh, Juhyun, et al.
Published: (2026)
KITE: A Benchmark for Evaluating Korean Instruction-Following Abilities in Large Language Models
by: Kim, Dongjun, et al.
Published: (2025)
by: Kim, Dongjun, et al.
Published: (2025)
Benchmarking Motivational Interviewing Competence of Large Language Models
by: Jha, Aishwariya, et al.
Published: (2026)
by: Jha, Aishwariya, et al.
Published: (2026)
KoBBQ: Korean Bias Benchmark for Question Answering
by: Jin, Jiho, et al.
Published: (2023)
by: Jin, Jiho, et al.
Published: (2023)
When Tom Eats Kimchi: Evaluating Cultural Bias of Multimodal Large Language Models in Cultural Mixture Contexts
by: Kim, Jun Seong, et al.
Published: (2025)
by: Kim, Jun Seong, et al.
Published: (2025)
One-Topic-Doesn't-Fit-All: Transcreating Reading Comprehension Test for Personalized Learning
by: Han, Jieun, et al.
Published: (2025)
by: Han, Jieun, et al.
Published: (2025)
Spotting Out-of-Character Behavior: Atomic-Level Evaluation of Persona Fidelity in Open-Ended Generation
by: Shin, Jisu, et al.
Published: (2025)
by: Shin, Jisu, et al.
Published: (2025)
MUG-Eval: A Proxy Evaluation Framework for Multilingual Generation Capabilities in Any Language
by: Song, Seyoung, et al.
Published: (2025)
by: Song, Seyoung, et al.
Published: (2025)
Shortened LLaMA: Depth Pruning for Large Language Models with Comparison of Retraining Methods
by: Kim, Bo-Kyeong, et al.
Published: (2024)
by: Kim, Bo-Kyeong, et al.
Published: (2024)
VALUEFLOW: Toward Pluralistic and Steerable Value-based Alignment in Large Language Models
by: Kim, Woojin, et al.
Published: (2026)
by: Kim, Woojin, et al.
Published: (2026)
GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
by: Sung, Yoo Yeon, et al.
Published: (2025)
by: Sung, Yoo Yeon, et al.
Published: (2025)
"I'd Like to Have an Argument, Please": Argumentative Reasoning in Large Language Models
by: de Wynter, Adrian, et al.
Published: (2023)
by: de Wynter, Adrian, et al.
Published: (2023)
Are they lovers or friends? Evaluating LLMs' Social Reasoning in English and Korean Dialogues
by: Kim, Eunsu, et al.
Published: (2025)
by: Kim, Eunsu, et al.
Published: (2025)
Quantifying Conversational Reliability of Large Language Models under Multi-Turn Interaction
by: Myung, Jiyoon
Published: (2026)
by: Myung, Jiyoon
Published: (2026)
How the Advent of Ubiquitous Large Language Models both Stymie and Turbocharge Dynamic Adversarial Question Generation
by: Sung, Yoo Yeon, et al.
Published: (2024)
by: Sung, Yoo Yeon, et al.
Published: (2024)
Toward Evaluative Thinking: Meta Policy Optimization with Evolving Reward Models
by: Kim, Zae Myung, et al.
Published: (2025)
by: Kim, Zae Myung, et al.
Published: (2025)
Soft Inductive Bias Approach via Explicit Reasoning Perspectives in Inappropriate Utterance Detection Using Large Language Models
by: Kim, Ju-Young, et al.
Published: (2025)
by: Kim, Ju-Young, et al.
Published: (2025)
Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study
by: So, Jae-hee, et al.
Published: (2024)
by: So, Jae-hee, et al.
Published: (2024)
Similar Items
-
Social Bias Benchmark for Generation: A Comparison of Generation and QA-Based Evaluations
by: Jin, Jiho, et al.
Published: (2025) -
ChEDDAR: Student-ChatGPT Dialogue in EFL Writing Education
by: Han, Jieun, et al.
Published: (2023) -
RECIPE4U: Student-ChatGPT Interaction Dataset in EFL Writing Education
by: Han, Jieun, et al.
Published: (2024) -
On the Effect of Uncertainty on Layer-wise Inference Dynamics
by: Kim, Sunwoo, et al.
Published: (2025) -
LLM-as-a-tutor in EFL Writing Education: Focusing on Evaluation of Student-LLM Interaction
by: Han, Jieun, et al.
Published: (2023)