Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.15867 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866910657724547072 |
|---|---|
| author | Xu, Moucheng Chatzaroulas, Evangelos McCutcheon, Luc Ahad, Abdul Azeem, Hamzah Marecki, Janusz Anwar, Ammar |
| author_facet | Xu, Moucheng Chatzaroulas, Evangelos McCutcheon, Luc Ahad, Abdul Azeem, Hamzah Marecki, Janusz Anwar, Ammar |
| contents | A Standard Operating Procedure (SOP) defines a low-level, step-by-step written guide for a business software workflow. SOP generation is a crucial step towards automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. In this work, we first explore in-context learning with video-language models for SOP generation. We then propose an exploration-focused strategy called In-Context Ensemble Learning, to aggregate pseudo labels of multiple possible paths of SOPs. The proposed in-context ensemble learning as well enables the models to learn beyond its context window limit with an implicit consistency regularisation. We report that in-context learning helps video-language models to generate more temporally accurate SOP, and the proposed in-context ensemble learning can consistently enhance the capabilities of the video-language models in SOP generation. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2409_15867 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding Xu, Moucheng Chatzaroulas, Evangelos McCutcheon, Luc Ahad, Abdul Azeem, Hamzah Marecki, Janusz Anwar, Ammar Artificial Intelligence A Standard Operating Procedure (SOP) defines a low-level, step-by-step written guide for a business software workflow. SOP generation is a crucial step towards automating end-to-end software workflows. Manually creating SOPs can be time-consuming. Recent advancements in large video-language models offer the potential for automating SOP generation by analyzing recordings of human demonstrations. However, current large video-language models face challenges with zero-shot SOP generation. In this work, we first explore in-context learning with video-language models for SOP generation. We then propose an exploration-focused strategy called In-Context Ensemble Learning, to aggregate pseudo labels of multiple possible paths of SOPs. The proposed in-context ensemble learning as well enables the models to learn beyond its context window limit with an implicit consistency regularisation. We report that in-context learning helps video-language models to generate more temporally accurate SOP, and the proposed in-context ensemble learning can consistently enhance the capabilities of the video-language models in SOP generation. |
| title | In-Context Ensemble Learning from Pseudo Labels Improves Video-Language Models for Low-Level Workflow Understanding |
| topic | Artificial Intelligence |
| url | https://arxiv.org/abs/2409.15867 |