:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Canaverde, Beatriz, Alves, Duarte M., Pombal, José, Attanasio, Giuseppe, Martins, André F. T.
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2605.06353
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

LegalBench.PT: A Benchmark for Portuguese Law
by: Canaverde, Beatriz, et al.
Published: (2025)

Zero-shot Benchmarking: A Framework for Flexible and Scalable Automatic Evaluation of Language Models
by: Pombal, José, et al.
Published: (2025)

Self-Preference Bias in Rubric-Based Evaluation of Large Language Models
by: Pombal, José, et al.
Published: (2026)

Watching the Watchers: Exposing Gender Disparities in Machine Translation Quality Estimation
by: Zaranis, Emmanouil, et al.
Published: (2024)

Instituto de Telecomunicações at IWSLT 2025: Aligning Small-Scale Speech and Language Models for Speech-to-Text Learning
by: Attanasio, Giuseppe, et al.
Published: (2025)

Adding Chocolate to Mint: Mitigating Metric Interference in Machine Translation
by: Pombal, José, et al.
Published: (2025)

A Context-aware Framework for Translation-mediated Conversations
by: Pombal, José, et al.
Published: (2024)

Tower+: Bridging Generality and Translation Specialization in Multilingual LLMs
by: Rei, Ricardo, et al.
Published: (2025)

MindEval: Benchmarking Language Models on Multi-turn Mental Health Support
by: Pombal, José, et al.
Published: (2025)

MATH-PT: A Math Reasoning Benchmark for European and Brazilian Portuguese
by: Teixeira, Tiago, et al.
Published: (2026)

Different Speech Translation Models Encode and Translate Speaker Gender Differently
by: Fucci, Dennis, et al.
Published: (2025)

Multi-IF: Benchmarking LLMs on Multi-Turn and Multilingual Instructions Following
by: He, Yun, et al.
Published: (2024)

Tower: An Open Multilingual Large Language Model for Translation-Related Tasks
by: Alves, Duarte M., et al.
Published: (2024)

Combining On-Policy Optimization and Distillation for Long-Context Reasoning in Large Language Models
by: Ramos, Miguel Moura, et al.
Published: (2026)

MultiChallenge: A Realistic Multi-Turn Conversation Evaluation Benchmark Challenging to Frontier LLMs
by: Sirdeshmukh, Ved, et al.
Published: (2025)

xTower: A Multilingual LLM for Explaining and Correcting Translation Errors
by: Treviso, Marcos, et al.
Published: (2024)

Building Bridges: A Dataset for Evaluating Gender-Fair Machine Translation into German
by: Lardelli, Manuel, et al.
Published: (2024)

EuroLLM: Multilingual Language Models for Europe
by: Martins, Pedro Henrique, et al.
Published: (2024)

M-Prometheus: A Suite of Open Multilingual LLM Judges
by: Pombal, José, et al.
Published: (2025)

AMALIA Technical Report: A Fully Open Source Large Language Model for European Portuguese
by: Simplício, Afonso, et al.
Published: (2026)

Safety-Tuned LLaMAs: Lessons From Improving the Safety of Large Language Models that Follow Instructions
by: Bianchi, Federico, et al.
Published: (2023)

MMMT-IF: A Challenging Multimodal Multi-Turn Instruction Following Benchmark
by: Epstein, Elliot L., et al.
Published: (2024)

FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
by: Jiang, Yuxin, et al.
Published: (2023)

Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps
by: Attanasio, Giuseppe, et al.
Published: (2024)

MCJudgeBench: A Benchmark for Constraint-Level Judge Evaluation in Multi-Constraint Instruction Following
by: Lee, Jaeyun, et al.
Published: (2026)

DIAL: Direct Iterative Adversarial Learning for Realistic Multi-Turn Dialogue Simulation
by: Zhu, Ziyi, et al.
Published: (2025)

CFBench: A Comprehensive Constraints-Following Benchmark for LLMs
by: Zhang, Tao, et al.
Published: (2024)

Movie Facts and Fibs (MF$^2$): A Benchmark for Long Movie Understanding
by: Zaranis, Emmanouil, et al.
Published: (2025)

EuroLLM-9B: Technical Report
by: Martins, Pedro Henrique, et al.
Published: (2025)

Classist Tools: Social Class Correlates with Performance in NLP
by: Curry, Amanda Cercas, et al.
Published: (2024)

WorldTravel: A Realistic Multimodal Travel-Planning Benchmark with Tightly Coupled Constraints
by: Wang, Zexuan, et al.
Published: (2026)

LLM-Driven Multi-Turn Task-Oriented Dialogue Synthesis for Realistic Reasoning
by: Zhu, Yu, et al.
Published: (2026)

TurnBench-MS: A Benchmark for Evaluating Multi-Turn, Multi-Step Reasoning in Large Language Models
by: Zhang, Yiran, et al.
Published: (2025)

DRBench: A Realistic Benchmark for Enterprise Deep Research
by: Abaskohi, Amirhossein, et al.
Published: (2025)

GAMBIT+: A Challenge Set for Evaluating Gender Bias in Machine Translation Quality Estimation Metrics
by: Filandrianos, Giorgos, et al.
Published: (2025)

One Battle After Another: Probing LLMs' Limits on Multi-Turn Instruction Following with a Benchmark Evolving Framework
by: Jia, Qi, et al.
Published: (2025)

Parrot: Enhancing Multi-Turn Instruction Following for Large Language Models
by: Sun, Yuchong, et al.
Published: (2023)

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety
by: Bisconti, Piercosma, et al.
Published: (2026)

EuroLLM-22B: Technical Report
by: Ramos, Miguel Moura, et al.
Published: (2026)

Benchmarking Complex Instruction-Following with Multiple Constraints Composition
by: Wen, Bosi, et al.
Published: (2024)