:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Son, Guijin, Yoon, Dongkeun, Suk, Juyoung, Aula-Blasco, Javier, Aslan, Mano, Kim, Vu Trong, Islam, Shayekh Bin, Prats-Cristià, Jaume, Tormo-Bañuelos, Lucía, Kim, Seungone
Format:	Preprint
Published:	2024
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2410.17578
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LangBridge: Multilingual Reasoning Without Multilingual Supervision
by: Yoon, Dongkeun, et al.
Published: (2024)

M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)

LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
by: Son, Guijin, et al.
Published: (2024)

LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2024)

Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?
by: Son, Guijin, et al.
Published: (2024)

K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
by: Lee, Nahyun, et al.
Published: (2026)

Reasoning Models Better Express Their Confidence
by: Yoon, Dongkeun, et al.
Published: (2025)

Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
by: Ko, Hyunwoo, et al.
Published: (2025)

Revisiting the Uniform Information Density Hypothesis in LLM Reasoning
by: Gwak, Minju, et al.
Published: (2025)

Revisiting the UID Hypothesis in LLM Reasoning Traces
by: Gwak, Minju, et al.
Published: (2025)

Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context
by: Choi, Dasol, et al.
Published: (2025)

Controlling Language Confusion in Multilingual LLMs
by: Lee, Nahyun, et al.
Published: (2025)

KMMLU: Measuring Massive Multitask Language Understanding in Korean
by: Son, Guijin, et al.
Published: (2024)

VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
by: Waheed, Abdul, et al.
Published: (2025)

Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
by: Kim, Seungyoon, et al.
Published: (2024)

FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
by: Kim, Chaeeun, et al.
Published: (2025)

Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation
by: Lee, Seongyun, et al.
Published: (2024)

An Analysis of Multilingual FActScore
by: Vu, Kim Trong, et al.
Published: (2024)

Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards
by: Hwang, Hyeonbin, et al.
Published: (2024)

84‐4: Achieving Low Chroma Edges in Curved Cover Glass with Anti‐Reflection and Anti‐Scratch Properties
by: Juyoung Yoon, et al.
Published: (2025)

M-RewardBench: Evaluating Reward Models in Multilingual Settings
by: Gureja, Srishti, et al.
Published: (2024)

On the Robustness of Reward Models for Language Model Alignment
by: Hong, Jiwoo, et al.
Published: (2025)

Evaluating Language Models as Synthetic Data Generators
by: Kim, Seungone, et al.
Published: (2024)

Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
by: Lee, Nahyun, et al.
Published: (2026)

Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
by: Kim, Seungone, et al.
Published: (2024)

ESG Classification by Implicit Rule Learning via GPT-4
by: Yun, Hyo Jeong, et al.
Published: (2024)

MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
by: Tamayo, Daniel, et al.
Published: (2026)

Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection
by: Kim, Dongkeun, et al.
Published: (2025)

ADOLESCENCIA Y DROGAS
by: Yalltza Aula
Published: (2011)

La diligencia debida como herramienta de prevención del conflicto en la República Democrática del Congo
by: Ilari Aula
Published: (2020)

From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
by: Hong, Seokhee, et al.
Published: (2025)

Improving Fine-grained Visual Understanding in VLMs through Text-Only Training
by: Choi, Dasol, et al.
Published: (2024)

Multi-Step Reasoning in Korean and the Emergent Mirage
by: Son, Guijin, et al.
Published: (2025)

MOMEMTO: Patch-based Memory Gate Model in Time Series Foundation Model
by: Yoon, Samuel, et al.
Published: (2025)

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
by: Kim, Seungone, et al.
Published: (2024)

Online Temporal Action Localization with Memory-Augmented Transformer
by: Song, Youngkil, et al.
Published: (2024)

Towards More Practical Group Activity Detection: A New Benchmark and Model
by: Kim, Dongkeun, et al.
Published: (2023)

CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
by: Lee, Yukyung, et al.
Published: (2024)

CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
by: Kim, Eunsu, et al.
Published: (2024)

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
by: Gwak, Minju, et al.
Published: (2026)