Internformat: :: Library Catalog

Gespeichert in:

Bibliographische Detailangaben
Hauptverfasser:	Jin, Emily, Huang, Zhuoyi, Fränken, Jan-Philipp, Liu, Weiyu, Cha, Hannah, Brockbank, Erik, Wu, Sarah, Zhang, Ruohan, Wu, Jiajun, Gerstenberg, Tobias
Format:	Preprint
Veröffentlicht:	2024
Schlagworte:	Machine Learning
Online-Zugang:	https://arxiv.org/abs/2410.01926
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

_version_	1866912055883202560
author	Jin, Emily Huang, Zhuoyi Fränken, Jan-Philipp Liu, Weiyu Cha, Hannah Brockbank, Erik Wu, Sarah Zhang, Ruohan Wu, Jiajun Gerstenberg, Tobias
author_facet	Jin, Emily Huang, Zhuoyi Fränken, Jan-Philipp Liu, Weiyu Cha, Hannah Brockbank, Erik Wu, Sarah Zhang, Ruohan Wu, Jiajun Gerstenberg, Tobias
contents	Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models are less robust and performant, while GPT-4 has difficulty comprehending environmental changes. We analyze what factors influence inference performance and ablate different modes of evidence, finding that all modes are valuable for performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.
format	Preprint
id	arxiv_https___arxiv_org_abs_2410_01926
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	MARPLE: A Benchmark for Long-Horizon Inference Jin, Emily Huang, Zhuoyi Fränken, Jan-Philipp Liu, Weiyu Cha, Hannah Brockbank, Erik Wu, Sarah Zhang, Ruohan Wu, Jiajun Gerstenberg, Tobias Machine Learning Reconstructing past events requires reasoning across long time horizons. To figure out what happened, we need to use our prior knowledge about the world and human behavior and draw inferences from various sources of evidence including visual, language, and auditory cues. We introduce MARPLE, a benchmark for evaluating long-horizon inference capabilities using multi-modal evidence. Our benchmark features agents interacting with simulated households, supporting vision, language, and auditory stimuli, as well as procedurally generated environments and agent behaviors. Inspired by classic ``whodunit'' stories, we ask AI models and human participants to infer which agent caused a change in the environment based on a step-by-step replay of what actually happened. The goal is to correctly identify the culprit as early as possible. Our findings show that human participants outperform both traditional Monte Carlo simulation methods and an LLM baseline (GPT-4) on this task. Compared to humans, traditional inference models are less robust and performant, while GPT-4 has difficulty comprehending environmental changes. We analyze what factors influence inference performance and ablate different modes of evidence, finding that all modes are valuable for performance. Overall, our experiments demonstrate that the long-horizon, multimodal inference tasks in our benchmark present a challenge to current models.
title	MARPLE: A Benchmark for Long-Horizon Inference
topic	Machine Learning
url	https://arxiv.org/abs/2410.01926

Ähnliche Einträge