Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zgreabăn, Mădălina, Deoskar, Tejaswini, Abzianidze, Lasha
Format:	Preprint
Published:	2025
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2510.24295
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866912673966325760
author	Zgreabăn, Mădălina Deoskar, Tejaswini Abzianidze, Lasha
author_facet	Zgreabăn, Mădălina Deoskar, Tejaswini Abzianidze, Lasha
contents	In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_24295
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference Zgreabăn, Mădălina Deoskar, Tejaswini Abzianidze, Lasha Computation and Language In recent years, many generalization benchmarks have shown language models' lack of robustness in natural language inference (NLI). However, manually creating new benchmarks is costly, while automatically generating high-quality ones, even by modifying existing benchmarks, is extremely difficult. In this paper, we propose a methodology for automatically generating high-quality variants of original NLI problems by replacing open-class words, while crucially preserving their underlying reasoning. We dub our generalization test as MERGE (Minimal Expression-Replacements GEneralization), which evaluates the correctness of models' predictions across reasoning-preserving variants of the original problem. Our results show that NLI models' perform 4-20% worse on variants, suggesting low generalizability even on such minimally altered problems. We also analyse how word class of the replacements, word probability, and plausibility influence NLI models' performance.
title	MERGE: Minimal Expression-Replacement GEneralization Test for Natural Language Inference
topic	Computation and Language
url	https://arxiv.org/abs/2510.24295

Similar Items