Saved in:
| Main Authors: | Son, Guijin, Yoon, Dongkeun, Suk, Juyoung, Aula-Blasco, Javier, Aslan, Mano, Kim, Vu Trong, Islam, Shayekh Bin, Prats-Cristià, Jaume, Tormo-Bañuelos, Lucía, Kim, Seungone |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.17578 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LangBridge: Multilingual Reasoning Without Multilingual Supervision
by: Yoon, Dongkeun, et al.
Published: (2024)
by: Yoon, Dongkeun, et al.
Published: (2024)
M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
by: Son, Guijin, et al.
Published: (2024)
by: Son, Guijin, et al.
Published: (2024)
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2024)
by: Kim, Eunsu, et al.
Published: (2024)
Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?
by: Son, Guijin, et al.
Published: (2024)
by: Son, Guijin, et al.
Published: (2024)
K-BrowseComp: A Web Browsing Agent Benchmark Grounded in Korean Contexts
by: Lee, Nahyun, et al.
Published: (2026)
by: Lee, Nahyun, et al.
Published: (2026)
Reasoning Models Better Express Their Confidence
by: Yoon, Dongkeun, et al.
Published: (2025)
by: Yoon, Dongkeun, et al.
Published: (2025)
Understand, Solve and Translate: Bridging the Multilingual Mathematical Reasoning Gap
by: Ko, Hyunwoo, et al.
Published: (2025)
by: Ko, Hyunwoo, et al.
Published: (2025)
Revisiting the Uniform Information Density Hypothesis in LLM Reasoning
by: Gwak, Minju, et al.
Published: (2025)
by: Gwak, Minju, et al.
Published: (2025)
Revisiting the UID Hypothesis in LLM Reasoning Traces
by: Gwak, Minju, et al.
Published: (2025)
by: Gwak, Minju, et al.
Published: (2025)
Ko-PIQA: A Korean Physical Commonsense Reasoning Dataset with Cultural Context
by: Choi, Dasol, et al.
Published: (2025)
by: Choi, Dasol, et al.
Published: (2025)
Controlling Language Confusion in Multilingual LLMs
by: Lee, Nahyun, et al.
Published: (2025)
by: Lee, Nahyun, et al.
Published: (2025)
KMMLU: Measuring Massive Multitask Language Understanding in Korean
by: Son, Guijin, et al.
Published: (2024)
by: Son, Guijin, et al.
Published: (2024)
VideoJudge: Bootstrapping Enables Scalable Supervision of MLLM-as-a-Judge for Video Understanding
by: Waheed, Abdul, et al.
Published: (2025)
by: Waheed, Abdul, et al.
Published: (2025)
Can Language Models Evaluate Human Written Text? Case Study on Korean Student Writing for Education
by: Kim, Seungyoon, et al.
Published: (2024)
by: Kim, Seungyoon, et al.
Published: (2024)
FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
by: Kim, Chaeeun, et al.
Published: (2025)
by: Kim, Chaeeun, et al.
Published: (2025)
Prometheus-Vision: Vision-Language Model as a Judge for Fine-Grained Evaluation
by: Lee, Seongyun, et al.
Published: (2024)
by: Lee, Seongyun, et al.
Published: (2024)
An Analysis of Multilingual FActScore
by: Vu, Kim Trong, et al.
Published: (2024)
by: Vu, Kim Trong, et al.
Published: (2024)
Self-Explore: Enhancing Mathematical Reasoning in Language Models with Fine-grained Rewards
by: Hwang, Hyeonbin, et al.
Published: (2024)
by: Hwang, Hyeonbin, et al.
Published: (2024)
84‐4: Achieving Low Chroma Edges in Curved Cover Glass with Anti‐Reflection and Anti‐Scratch Properties
by: Juyoung Yoon, et al.
Published: (2025)
by: Juyoung Yoon, et al.
Published: (2025)
M-RewardBench: Evaluating Reward Models in Multilingual Settings
by: Gureja, Srishti, et al.
Published: (2024)
by: Gureja, Srishti, et al.
Published: (2024)
On the Robustness of Reward Models for Language Model Alignment
by: Hong, Jiwoo, et al.
Published: (2025)
by: Hong, Jiwoo, et al.
Published: (2025)
Evaluating Language Models as Synthetic Data Generators
by: Kim, Seungone, et al.
Published: (2024)
by: Kim, Seungone, et al.
Published: (2024)
Pushing the Boundaries of Multiple Choice Evaluation to One Hundred Options
by: Lee, Nahyun, et al.
Published: (2026)
by: Lee, Nahyun, et al.
Published: (2026)
Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models
by: Kim, Seungone, et al.
Published: (2024)
by: Kim, Seungone, et al.
Published: (2024)
ESG Classification by Implicit Rule Learning via GPT-4
by: Yun, Hyo Jeong, et al.
Published: (2024)
by: Yun, Hyo Jeong, et al.
Published: (2024)
MrBERT: Modern Multilingual Encoders via Vocabulary, Domain, and Dimensional Adaptation
by: Tamayo, Daniel, et al.
Published: (2026)
by: Tamayo, Daniel, et al.
Published: (2026)
Part-Aware Bottom-Up Group Reasoning for Fine-Grained Social Interaction Detection
by: Kim, Dongkeun, et al.
Published: (2025)
by: Kim, Dongkeun, et al.
Published: (2025)
ADOLESCENCIA Y DROGAS
by: Yalltza Aula
Published: (2011)
by: Yalltza Aula
Published: (2011)
La diligencia debida como herramienta de prevención del conflicto en la República Democrática del Congo
by: Ilari Aula
Published: (2020)
by: Ilari Aula
Published: (2020)
From KMMLU-Redux to KMMLU-Pro: A Professional Korean Benchmark Suite for LLM Evaluation
by: Hong, Seokhee, et al.
Published: (2025)
by: Hong, Seokhee, et al.
Published: (2025)
Improving Fine-grained Visual Understanding in VLMs through Text-Only Training
by: Choi, Dasol, et al.
Published: (2024)
by: Choi, Dasol, et al.
Published: (2024)
Multi-Step Reasoning in Korean and the Emergent Mirage
by: Son, Guijin, et al.
Published: (2025)
by: Son, Guijin, et al.
Published: (2025)
MOMEMTO: Patch-based Memory Gate Model in Time Series Foundation Model
by: Yoon, Samuel, et al.
Published: (2025)
by: Yoon, Samuel, et al.
Published: (2025)
The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models
by: Kim, Seungone, et al.
Published: (2024)
by: Kim, Seungone, et al.
Published: (2024)
Online Temporal Action Localization with Memory-Augmented Transformer
by: Song, Youngkil, et al.
Published: (2024)
by: Song, Youngkil, et al.
Published: (2024)
Towards More Practical Group Activity Detection: A New Benchmark and Model
by: Kim, Dongkeun, et al.
Published: (2023)
by: Kim, Dongkeun, et al.
Published: (2023)
CheckEval: A reliable LLM-as-a-Judge framework for evaluating text generation using checklists
by: Lee, Yukyung, et al.
Published: (2024)
by: Lee, Yukyung, et al.
Published: (2024)
CLIcK: A Benchmark Dataset of Cultural and Linguistic Intelligence in Korean
by: Kim, Eunsu, et al.
Published: (2024)
by: Kim, Eunsu, et al.
Published: (2024)
LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training
by: Gwak, Minju, et al.
Published: (2026)
by: Gwak, Minju, et al.
Published: (2026)
Similar Items
-
LangBridge: Multilingual Reasoning Without Multilingual Supervision
by: Yoon, Dongkeun, et al.
Published: (2024) -
M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025) -
LLM-as-a-Judge & Reward Model: What They Can and Cannot Do
by: Son, Guijin, et al.
Published: (2024) -
LLM-as-an-Interviewer: Beyond Static Testing Through Dynamic LLM Evaluation
by: Kim, Eunsu, et al.
Published: (2024) -
Multi-Task Inference: Can Large Language Models Follow Multiple Instructions at Once?
by: Son, Guijin, et al.
Published: (2024)