Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.19997 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866917354824269824 |
|---|---|
| author | Bila, Natalia Naszádi, Kata Mayn, Alexandra Monz, Christof |
| author_facet | Bila, Natalia Naszádi, Kata Mayn, Alexandra Monz, Christof |
| contents | We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_19997 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | When Contextual Inference Fails: Cancelability in Interactive Instruction Following Bila, Natalia Naszádi, Kata Mayn, Alexandra Monz, Christof Computation and Language We investigate the separation of literal interpretation from contextual inference in a collaborative block-building task where a builder must resolve underspecified instructions using contextual inferences. Building on an existing two-speaker psycholinguistic paradigm -- which contrasts a pragmatically cooperative speaker with one who is only literally reliable -- we introduce Build What I Mean (BWIM), an interactive benchmark for contextual meaning construction. In BWIM, models must resolve ambiguity by either performing a contextual inference or requesting clarification at a small communication cost. Evaluating several state-of-the-art LLMs, we find a dissociation between judgment and action: while models detect speaker unreliability in explicit confidence ratings, they fail to exploit this information to guide efficient clarification behavior. Instead, we observe suboptimal strategies, such as partner-blind over-clarification and question-averse guessing under uncertainty. |
| title | When Contextual Inference Fails: Cancelability in Interactive Instruction Following |
| topic | Computation and Language |
| url | https://arxiv.org/abs/2603.19997 |