Saved in:
Bibliographic Details
Main Authors: Vodrahalli, Kiran, Ontanon, Santiago, Tripuraneni, Nilesh, Xu, Kelvin, Jain, Sanil, Shivanna, Rakesh, Hui, Jeffrey, Dikkala, Nishanth, Kazemi, Mehran, Fatemi, Bahare, Anil, Rohan, Dyer, Ethan, Shakeri, Siamak, Vij, Roopali, Mehta, Harsh, Ramasesh, Vinay, Le, Quoc, Chi, Ed, Lu, Yifeng, Firat, Orhan, Lazaridou, Angeliki, Lespiau, Jean-Baptiste, Attaluri, Nithya, Olszewska, Kate
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2409.12640
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912036231839744
author Vodrahalli, Kiran
Ontanon, Santiago
Tripuraneni, Nilesh
Xu, Kelvin
Jain, Sanil
Shivanna, Rakesh
Hui, Jeffrey
Dikkala, Nishanth
Kazemi, Mehran
Fatemi, Bahare
Anil, Rohan
Dyer, Ethan
Shakeri, Siamak
Vij, Roopali
Mehta, Harsh
Ramasesh, Vinay
Le, Quoc
Chi, Ed
Lu, Yifeng
Firat, Orhan
Lazaridou, Angeliki
Lespiau, Jean-Baptiste
Attaluri, Nithya
Olszewska, Kate
author_facet Vodrahalli, Kiran
Ontanon, Santiago
Tripuraneni, Nilesh
Xu, Kelvin
Jain, Sanil
Shivanna, Rakesh
Hui, Jeffrey
Dikkala, Nishanth
Kazemi, Mehran
Fatemi, Bahare
Anil, Rohan
Dyer, Ethan
Shakeri, Siamak
Vij, Roopali
Mehta, Harsh
Ramasesh, Vinay
Le, Quoc
Chi, Ed
Lu, Yifeng
Firat, Orhan
Lazaridou, Angeliki
Lespiau, Jean-Baptiste
Attaluri, Nithya
Olszewska, Kate
contents We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.
format Preprint
id arxiv_https___arxiv_org_abs_2409_12640
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
Vodrahalli, Kiran
Ontanon, Santiago
Tripuraneni, Nilesh
Xu, Kelvin
Jain, Sanil
Shivanna, Rakesh
Hui, Jeffrey
Dikkala, Nishanth
Kazemi, Mehran
Fatemi, Bahare
Anil, Rohan
Dyer, Ethan
Shakeri, Siamak
Vij, Roopali
Mehta, Harsh
Ramasesh, Vinay
Le, Quoc
Chi, Ed
Lu, Yifeng
Firat, Orhan
Lazaridou, Angeliki
Lespiau, Jean-Baptiste
Attaluri, Nithya
Olszewska, Kate
Computation and Language
Machine Learning
We introduce Michelangelo: a minimal, synthetic, and unleaked long-context reasoning evaluation for large language models which is also easy to automatically score. This evaluation is derived via a novel, unifying framework for evaluations over arbitrarily long contexts which measure the model's ability to do more than retrieve a single piece of information from its context. The central idea of the Latent Structure Queries framework (LSQ) is to construct tasks which require a model to ``chisel away'' the irrelevant information in the context, revealing a latent structure in the context. To verify a model's understanding of this latent structure, we query the model for details of the structure. Using LSQ, we produce three diagnostic long-context evaluations across code and natural-language domains intended to provide a stronger signal of long-context language model capabilities. We perform evaluations on several state-of-the-art models and demonstrate both that a) the proposed evaluations are high-signal and b) that there is significant room for improvement in synthesizing long-context information.
title Michelangelo: Long Context Evaluations Beyond Haystacks via Latent Structure Queries
topic Computation and Language
Machine Learning
url https://arxiv.org/abs/2409.12640