Saved in:
| Main Authors: | Imajo, Kentaro, Hirano, Masanori, Suzuki, Shuji, Mikami, Hiroaki |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.09316 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training
by: Hirano, Masanori, et al.
Published: (2024)
by: Hirano, Masanori, et al.
Published: (2024)
The Construction of Instruction-tuned LLMs for Finance without Instruction Data Using Continual Pretraining and Model Merging
by: Hirano, Masanori, et al.
Published: (2024)
by: Hirano, Masanori, et al.
Published: (2024)
Financial Fine-tuning a Large Time Series Model
by: Fu, Xinghong, et al.
Published: (2024)
by: Fu, Xinghong, et al.
Published: (2024)
Enhancing Financial Domain Adaptation of Language Models via Model Augmentation
by: Tanabe, Kota, et al.
Published: (2024)
by: Tanabe, Kota, et al.
Published: (2024)
Construction of a Japanese Financial Benchmark for Large Language Models
by: Hirano, Masanori
Published: (2024)
by: Hirano, Masanori
Published: (2024)
Uncovering Residual Factors in Financial Time Series via PCA and MTP2-constrained Gaussian Graphical Models
by: Watanabe, Koshi, et al.
Published: (2026)
by: Watanabe, Koshi, et al.
Published: (2026)
PLaMo-100B: A Ground-Up Language Model Designed for Japanese Proficiency
by: Elements, Preferred, et al.
Published: (2024)
by: Elements, Preferred, et al.
Published: (2024)
Who Judges the Judge? Evaluating LLM-as-a-Judge for French Medical open-ended QA
by: Belmadani, Ikram, et al.
Published: (2026)
by: Belmadani, Ikram, et al.
Published: (2026)
How Individual Traits and Language Styles Shape Preferences In Open-ended User-LLM Interaction: A Preliminary Study
by: Chevi, Rendi, et al.
Published: (2025)
by: Chevi, Rendi, et al.
Published: (2025)
RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator
by: Tang, Zhenwei, et al.
Published: (2026)
by: Tang, Zhenwei, et al.
Published: (2026)
JudgeSense: A Benchmark for Prompt Sensitivity in LLM-as-a-Judge Systems
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
by: Bellibatlu, Rohith Reddy, et al.
Published: (2026)
VELA: An LLM-Hybrid-as-a-Judge Approach for Evaluating Long Image Captions
by: Matsuda, Kazuki, et al.
Published: (2025)
by: Matsuda, Kazuki, et al.
Published: (2025)
JudgeBench: A Benchmark for Evaluating LLM-based Judges
by: Tan, Sijun, et al.
Published: (2024)
by: Tan, Sijun, et al.
Published: (2024)
LCTG Bench: LLM Controlled Text Generation Benchmark
by: Kurihara, Kentaro, et al.
Published: (2025)
by: Kurihara, Kentaro, et al.
Published: (2025)
Attribution Quality in AI-Generated Content:Benchmarking Style Embeddings and LLM Judges
by: Abbas, Misam
Published: (2025)
by: Abbas, Misam
Published: (2025)
Retcon -- a Prompt-Based Technique for Precise Control of LLMs in Conversations
by: Kogan, David, et al.
Published: (2026)
by: Kogan, David, et al.
Published: (2026)
MathTutorBench: A Benchmark for Measuring Open-ended Pedagogical Capabilities of LLM Tutors
by: Macina, Jakub, et al.
Published: (2025)
by: Macina, Jakub, et al.
Published: (2025)
M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
Improving LLM-as-a-Judge Inference with the Judgment Distribution
by: Wang, Victor, et al.
Published: (2025)
by: Wang, Victor, et al.
Published: (2025)
CodeJudgeBench: Benchmarking LLM-as-a-Judge for Coding Tasks
by: Jiang, Hongchao, et al.
Published: (2025)
by: Jiang, Hongchao, et al.
Published: (2025)
Benchmarking LLM-as-a-Judge for Long-Form Output Evaluation
by: Chen, Junjie, et al.
Published: (2026)
by: Chen, Junjie, et al.
Published: (2026)
PLaMo 2 Technical Report
by: Networks, Preferred, et al.
Published: (2025)
by: Networks, Preferred, et al.
Published: (2025)
Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators
by: Zhou, Yilun, et al.
Published: (2025)
by: Zhou, Yilun, et al.
Published: (2025)
Debatable Intelligence: Benchmarking LLM Judges via Debate Speech Evaluation
by: Sternlicht, Noy, et al.
Published: (2025)
by: Sternlicht, Noy, et al.
Published: (2025)
SEAL: Can Saturated Benchmarks Be Revived by LLM-as-a-Meta-Judge?
by: Chen, Jiamin, et al.
Published: (2026)
by: Chen, Jiamin, et al.
Published: (2026)
MLLM-as-a-Judge: Assessing Multimodal LLM-as-a-Judge with Vision-Language Benchmark
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
R-Judge: Benchmarking Safety Risk Awareness for LLM Agents
by: Yuan, Tongxin, et al.
Published: (2024)
by: Yuan, Tongxin, et al.
Published: (2024)
CompliBench: Benchmarking LLM Judges for Compliance Violation Detection in Dialogue Systems
by: Yang, Jingbo, et al.
Published: (2026)
by: Yang, Jingbo, et al.
Published: (2026)
The African Woman is Rhythmic and Soulful: An Investigation of Implicit Biases in LLM Open-ended Text Generation
by: Lim, Serene, et al.
Published: (2024)
by: Lim, Serene, et al.
Published: (2024)
JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors
by: Jin, Jiho, et al.
Published: (2026)
by: Jin, Jiho, et al.
Published: (2026)
MM-Eval: A Multilingual Meta-Evaluation Benchmark for LLM-as-a-Judge and Reward Models
by: Son, Guijin, et al.
Published: (2024)
by: Son, Guijin, et al.
Published: (2024)
An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Model is not a General Substitute for GPT-4
by: Huang, Hui, et al.
Published: (2024)
by: Huang, Hui, et al.
Published: (2024)
Learning Personalized Alignment for Evaluating Open-ended Text Generation
by: Wang, Danqing, et al.
Published: (2023)
by: Wang, Danqing, et al.
Published: (2023)
Judging the Judges: A Systematic Study of Position Bias in LLM-as-a-Judge
by: Shi, Lin, et al.
Published: (2024)
by: Shi, Lin, et al.
Published: (2024)
HarmMetric Eval: Benchmarking Metrics and Judges for LLM Harmfulness Assessment
by: Yang, Langqi, et al.
Published: (2025)
by: Yang, Langqi, et al.
Published: (2025)
Reference-free Evaluation Metrics for Text Generation: A Survey
by: Ito, Takumi, et al.
Published: (2025)
by: Ito, Takumi, et al.
Published: (2025)
JuStRank: Benchmarking LLM Judges for System Ranking
by: Gera, Ariel, et al.
Published: (2024)
by: Gera, Ariel, et al.
Published: (2024)
The Silent Judge: Unacknowledged Shortcut Bias in LLM-as-a-Judge
by: Marioriyad, Arash, et al.
Published: (2025)
by: Marioriyad, Arash, et al.
Published: (2025)
FairJudge: An Adaptive, Debiased, and Consistent LLM-as-a-Judge
by: Yang, Bo, et al.
Published: (2026)
by: Yang, Bo, et al.
Published: (2026)
Improve LLM-as-a-Judge Ability as a General Ability
by: Yu, Jiachen, et al.
Published: (2025)
by: Yu, Jiachen, et al.
Published: (2025)
Similar Items
-
Construction of Domain-specified Japanese Large Language Model for Finance through Continual Pre-training
by: Hirano, Masanori, et al.
Published: (2024) -
The Construction of Instruction-tuned LLMs for Finance without Instruction Data Using Continual Pretraining and Model Merging
by: Hirano, Masanori, et al.
Published: (2024) -
Financial Fine-tuning a Large Time Series Model
by: Fu, Xinghong, et al.
Published: (2024) -
Enhancing Financial Domain Adaptation of Language Models via Model Augmentation
by: Tanabe, Kota, et al.
Published: (2024) -
Construction of a Japanese Financial Benchmark for Large Language Models
by: Hirano, Masanori
Published: (2024)