Saved in:
Bibliographic Details
Main Authors: Wornow, Michael, Narayan, Avanika, Viggiano, Ben, Khare, Ishan S., Verma, Tathagat, Thompson, Tibor, Hernandez, Miguel Angel Fuentes, Sundar, Sudharsan, Trujillo, Chloe, Chawla, Krrish, Lu, Rongfei, Shen, Justin, Nagaraj, Divya, Martinez, Joshua, Agrawal, Vardhan, Hudson, Althea, Shah, Nigam H., Re, Christopher
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.13264
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910644977008640
author Wornow, Michael
Narayan, Avanika
Viggiano, Ben
Khare, Ishan S.
Verma, Tathagat
Thompson, Tibor
Hernandez, Miguel Angel Fuentes
Sundar, Sudharsan
Trujillo, Chloe
Chawla, Krrish
Lu, Rongfei
Shen, Justin
Nagaraj, Divya
Martinez, Joshua
Agrawal, Vardhan
Hudson, Althea
Shah, Nigam H.
Re, Christopher
author_facet Wornow, Michael
Narayan, Avanika
Viggiano, Ben
Khare, Ishan S.
Verma, Tathagat
Thompson, Tibor
Hernandez, Miguel Angel Fuentes
Sundar, Sudharsan
Trujillo, Chloe
Chawla, Krrish
Lu, Rongfei
Shen, Justin
Nagaraj, Divya
Martinez, Joshua
Agrawal, Vardhan
Hudson, Althea
Shah, Nigam H.
Re, Christopher
contents Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread
format Preprint
id arxiv_https___arxiv_org_abs_2406_13264
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks
Wornow, Michael
Narayan, Avanika
Viggiano, Ben
Khare, Ishan S.
Verma, Tathagat
Thompson, Tibor
Hernandez, Miguel Angel Fuentes
Sundar, Sudharsan
Trujillo, Chloe
Chawla, Krrish
Lu, Rongfei
Shen, Justin
Nagaraj, Divya
Martinez, Joshua
Agrawal, Vardhan
Hudson, Althea
Shah, Nigam H.
Re, Christopher
Artificial Intelligence
Machine Learning
Software Engineering
Existing ML benchmarks lack the depth and diversity of annotations needed for evaluating models on business process management (BPM) tasks. BPM is the practice of documenting, measuring, improving, and automating enterprise workflows. However, research has focused almost exclusively on one task - full end-to-end automation using agents based on multimodal foundation models (FMs) like GPT-4. This focus on automation ignores the reality of how most BPM tools are applied today - simply documenting the relevant workflow takes 60% of the time of the typical process optimization project. To address this gap we present WONDERBREAD, the first benchmark for evaluating multimodal FMs on BPM tasks beyond automation. Our contributions are: (1) a dataset containing 2928 documented workflow demonstrations; (2) 6 novel BPM tasks sourced from real-world applications ranging from workflow documentation to knowledge transfer to process improvement; and (3) an automated evaluation harness. Our benchmark shows that while state-of-the-art FMs can automatically generate documentation (e.g. recalling 88% of the steps taken in a video demonstration of a workflow), they struggle to re-apply that knowledge towards finer-grained validation of workflow completion (F1 < 0.3). We hope WONDERBREAD encourages the development of more "human-centered" AI tooling for enterprise applications and furthers the exploration of multimodal FMs for the broader universe of BPM tasks. We publish our dataset and experiments here: https://github.com/HazyResearch/wonderbread
title WONDERBREAD: A Benchmark for Evaluating Multimodal Foundation Models on Business Process Management Tasks
topic Artificial Intelligence
Machine Learning
Software Engineering
url https://arxiv.org/abs/2406.13264