:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Imajo, Kentaro, Hirano, Masanori, Suzuki, Shuji, Mikami, Hiroaki
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2502.09316
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training
by: Hirano, Masanori, et al.
Published: (2024)

The Construction of Instruction-tuned LLMs for Finance without Instruction Data Using Continual Pretraining and Model Merging
by: Hirano, Masanori, et al.
Published: (2024)

Financial Fine-tuning a Large Time Series Model
by: Fu, Xinghong, et al.
Published: (2024)

Enhancing Financial Domain Adaptation of Language Models via Model Augmentation
by: Tanabe, Kota, et al.
Published: (2024)

Construction of a Japanese Financial Benchmark for Large Language Models
by: Hirano, Masanori
Published: (2024)

Uncovering Residual Factors in Financial Time Series via PCA and MTP2-constrained Gaussian Graphical Models
by: Watanabe, Koshi, et al.
Published: (2026)

PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency
by: Elements, Preferred, et al.
Published: (2024)

Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)

How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study
by: Chevi, Rendi, et al.
Published: (2025)

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
by: Tang, Zhenwei, et al.
Published: (2026)

JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)

VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
by: Matsuda, Kazuki, et al.
Published: (2025)

JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)

LCTG Bench: LLM Controlled Text Generation Benchmark
by: Kurihara, Kentaro, et al.
Published: (2025)

Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges
by: Abbas, Misam
Published: (2025)

Retcon -- a Prompt-Based Technique for Precise Control of LLMs in Conversations
by: Kogan, David, et al.
Published: (2026)

MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
by: Macina, Jakub, et al.
Published: (2025)

M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)

Improving LLM-as-a-Judge Inference with the Judgment Distribution
by: Wang, Victor, et al.
Published: (2025)

CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)

Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)

PLaMo 2 Technical Report
by: Networks, Preferred, et al.
Published: (2025)

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)

Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
by: Sternlicht, Noy, et al.
Published: (2025)

SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
by: Chen, Jiamin, et al.
Published: (2026)

MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)

R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)

CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
by: Yang, Jingbo, et al.
Published: (2026)

The African Woman is Rhythmic and Soulful: An Investigation of Implicit Biases in LLM Open-ended Text Generation
by: Lim, Serene, et al.
Published: (2024)

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
by: Jin, Jiho, et al.
Published: (2026)

MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
by: Son, Guijin, et al.
Published: (2024)

An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)

Learning Personalized Alignment for Evaluating Open-ended Text Generation
by: Wang, Danqing, et al.
Published: (2023)

Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)

HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)

Reference-free Evaluation Metrics for Text Generation: A Survey
by: Ito, Takumi, et al.
Published: (2025)

JuStRank: Benchmarking LLM Judges for System Ranking
by: Gera, Ariel, et al.
Published: (2024)

The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)

FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)

Improve LLM-as-a-Judge Ability as a General Ability
by: Yu, Jiachen, et al.
Published: (2025)