Saved in:
| Main Authors: | Canaverde, Beatriz, Alves, Duarte M., Pombal, José, Attanasio, Giuseppe, Martins, André F. T. |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.06353 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
LegalBench.PT: A Benchmark for Portuguese Law
by: Canaverde, Beatriz, et al.
Published: (2025)
by: Canaverde, Beatriz, et al.
Published: (2025)
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
by: Pombal, José, et al.
Published: (2026)
by: Pombal, José, et al.
Published: (2026)
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
by: Zaranis, Emmanouil, et al.
Published: (2024)
by: Zaranis, Emmanouil, et al.
Published: (2024)
Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning
by: Attanasio, Giuseppe, et al.
Published: (2025)
by: Attanasio, Giuseppe, et al.
Published: (2025)
Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
A Context-aware Framework for Translation-mediated Conversations
by: Pombal, José, et al.
Published: (2024)
by: Pombal, José, et al.
Published: (2024)
Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
by: Rei, Ricardo, et al.
Published: (2025)
by: Rei, Ricardo, et al.
Published: (2025)
MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
by: Teixeira, Tiago, et al.
Published: (2026)
by: Teixeira, Tiago, et al.
Published: (2026)
Different Speech Translation Models Encode and Translate Speaker Gender Differently
by: Fucci, Dennis, et al.
Published: (2025)
by: Fucci, Dennis, et al.
Published: (2025)
Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
by: He, Yun, et al.
Published: (2024)
by: He, Yun, et al.
Published: (2024)
Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
by: Alves, Duarte M., et al.
Published: (2024)
by: Alves, Duarte M., et al.
Published: (2024)
Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
by: Ramos, Miguel Moura, et al.
Published: (2026)
by: Ramos, Miguel Moura, et al.
Published: (2026)
MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
by: Sirdeshmukh, Ved, et al.
Published: (2025)
by: Sirdeshmukh, Ved, et al.
Published: (2025)
xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
by: Treviso, Marcos, et al.
Published: (2024)
by: Treviso, Marcos, et al.
Published: (2024)
Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into German
by: Lardelli, Manuel, et al.
Published: (2024)
by: Lardelli, Manuel, et al.
Published: (2024)
EuroLLM: Multilingual Language Models for Europe
by: Martins, Pedro Henrique, et al.
Published: (2024)
by: Martins, Pedro Henrique, et al.
Published: (2024)
M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)
by: Pombal, José, et al.
Published: (2025)
AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
by: Simplício, Afonso, et al.
Published: (2026)
by: Simplício, Afonso, et al.
Published: (2026)
Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
by: Bianchi, Federico, et al.
Published: (2023)
by: Bianchi, Federico, et al.
Published: (2023)
MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
by: Epstein, Elliot L., et al.
Published: (2024)
by: Epstein, Elliot L., et al.
Published: (2024)
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
by: Jiang, Yuxin, et al.
Published: (2023)
by: Jiang, Yuxin, et al.
Published: (2023)
Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
by: Attanasio, Giuseppe, et al.
Published: (2024)
by: Attanasio, Giuseppe, et al.
Published: (2024)
MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
by: Lee, Jaeyun, et al.
Published: (2026)
by: Lee, Jaeyun, et al.
Published: (2026)
DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
by: Zhu, Ziyi, et al.
Published: (2025)
by: Zhu, Ziyi, et al.
Published: (2025)
CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
by: Zhang, Tao, et al.
Published: (2024)
by: Zhang, Tao, et al.
Published: (2024)
Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
by: Zaranis, Emmanouil, et al.
Published: (2025)
by: Zaranis, Emmanouil, et al.
Published: (2025)
EuroLLM-9B: Technical Report
by: Martins, Pedro Henrique, et al.
Published: (2025)
by: Martins, Pedro Henrique, et al.
Published: (2025)
Classist Tools: Social Class Correlates with Performance in NLP
by: Curry, Amanda Cercas, et al.
Published: (2024)
by: Curry, Amanda Cercas, et al.
Published: (2024)
WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
by: Wang, Zexuan, et al.
Published: (2026)
by: Wang, Zexuan, et al.
Published: (2026)
LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
by: Zhu, Yu, et al.
Published: (2026)
by: Zhu, Yu, et al.
Published: (2026)
TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models
by: Zhang, Yiran, et al.
Published: (2025)
by: Zhang, Yiran, et al.
Published: (2025)
DRBench: A Realistic Benchmark for Enterprise Deep Research
by: Abaskohi, Amirhossein, et al.
Published: (2025)
by: Abaskohi, Amirhossein, et al.
Published: (2025)
GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics
by: Filandrianos, Giorgos, et al.
Published: (2025)
by: Filandrianos, Giorgos, et al.
Published: (2025)
One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
by: Jia, Qi, et al.
Published: (2025)
by: Jia, Qi, et al.
Published: (2025)
Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models
by: Sun, Yuchong, et al.
Published: (2023)
by: Sun, Yuchong, et al.
Published: (2023)
Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
by: Bisconti, Piercosma, et al.
Published: (2026)
by: Bisconti, Piercosma, et al.
Published: (2026)
EuroLLM-22B: Technical Report
by: Ramos, Miguel Moura, et al.
Published: (2026)
by: Ramos, Miguel Moura, et al.
Published: (2026)
Benchmarking Complex Instruction-Following with Multiple Constraints Composition
by: Wen, Bosi, et al.
Published: (2024)
by: Wen, Bosi, et al.
Published: (2024)
Similar Items
-
LegalBench.PT: A Benchmark for Portuguese Law
by: Canaverde, Beatriz, et al.
Published: (2025) -
Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
by: Pombal, José, et al.
Published: (2025) -
Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
by: Pombal, José, et al.
Published: (2026) -
Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
by: Zaranis, Emmanouil, et al.
Published: (2024) -
Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning
by: Attanasio, Giuseppe, et al.
Published: (2025)