Saved in:
| Main Authors: | , , , , , , , , , , , , , , , , , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.12640 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866912036231839744 |
|---|---|
| author | Vodrahalli, Kiran Ontanon, Santiago Tripuraneni, Nilesh Xu, Kelvin Jain, Sanil Shivanna, Rakesh Hui, Jeffrey Dikkala, Nishanth Kazemi, Mehran Fatemi, Bahare Anil, Rohan Dyer, Ethan Shakeri, Siamak Vij, Roopali Mehta, Harsh Ramasesh, Vinay Le, Quoc Chi, Ed Lu, Yifeng Firat, Orhan Lazaridou, Angeliki Lespiau, Jean-Baptiste Attaluri, Nithya Olszewska, Kate |
| author_facet | Vodrahalli, Kiran Ontanon, Santiago Tripuraneni, Nilesh Xu, Kelvin Jain, Sanil Shivanna, Rakesh Hui, Jeffrey Dikkala, Nishanth Kazemi, Mehran Fatemi, Bahare Anil, Rohan Dyer, Ethan Shakeri, Siamak Vij, Roopali Mehta, Harsh Ramasesh, Vinay Le, Quoc Chi, Ed Lu, Yifeng Firat, Orhan Lazaridou, Angeliki Lespiau, Jean-Baptiste Attaluri, Nithya Olszewska, Kate |
| contents | We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2409_12640 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries Vodrahalli, Kiran Ontanon, Santiago Tripuraneni, Nilesh Xu, Kelvin Jain, Sanil Shivanna, Rakesh Hui, Jeffrey Dikkala, Nishanth Kazemi, Mehran Fatemi, Bahare Anil, Rohan Dyer, Ethan Shakeri, Siamak Vij, Roopali Mehta, Harsh Ramasesh, Vinay Le, Quoc Chi, Ed Lu, Yifeng Firat, Orhan Lazaridou, Angeliki Lespiau, Jean-Baptiste Attaluri, Nithya Olszewska, Kate Computation and Language Machine Learning We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information. |
| title | Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries |
| topic | Computation and Language Machine Learning |
| url | https://arxiv.org/abs/2409.12640 |