MARC21: :: Library Catalog

Salvato in:

Dettagli Bibliografici
Autori principali:	Li, Wei, Hu, Bing, Shao, Rui, Shen, Leyang, Nie, Liqiang
Natura:	Preprint
Pubblicazione:	2025
Soggetti:	Computer Vision and Pattern Recognition
Accesso online:	https://arxiv.org/abs/2503.03663
Tags:	Aggiungi Tag Nessun Tag, puoi essere il primo ad aggiungerne!!

_version_	1866909526931800064
author	Li, Wei Hu, Bing Shao, Rui Shen, Leyang Nie, Liqiang
author_facet	Li, Wei Hu, Bing Shao, Rui Shen, Leyang Nie, Liqiang
contents	First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
format	Preprint
id	arxiv_https___arxiv_org_abs_2503_03663
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant Li, Wei Hu, Bing Shao, Rui Shen, Leyang Nie, Liqiang Computer Vision and Pattern Recognition First-person video assistants are highly anticipated to enhance our daily lives through online video dialogue. However, existing online video assistants often sacrifice assistant efficacy for real-time efficiency by processing low-frame-rate videos with coarse-grained visual features.To overcome the trade-off between efficacy and efficiency, we propose "Fast & Slow Video-Language Thinker" as an onLIne videO assistaNt, LION-FS, achieving real-time, proactive, temporally accurate, and contextually precise responses. LION-FS adopts a two-stage optimization strategy: 1)Fast Path: Routing-Based Response Determination evaluates frame-by-frame whether an immediate response is necessary. To enhance response determination accuracy and handle higher frame-rate inputs efficiently, we employ Token Aggregation Routing to dynamically fuse spatiotemporal features without increasing token numbers, while utilizing Token Dropping Routing to eliminate redundant features. 2)Slow Path: Multi-granularity Keyframe Augmentation optimizes keyframes during response generation. To provide comprehensive and detailed responses beyond atomic actions constrained by training data, fine-grained spatial features and human-environment interaction features are extracted through multi-granular pooling. These features are further integrated into a meticulously designed multimodal Thinking Template to guide more precise response generation. Comprehensive evaluations on online video tasks demonstrate that LION-FS achieves state-of-the-art efficacy and efficiency.
title	LION-FS: Fast & Slow Video-Language Thinker as Online Video Assistant
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2503.03663

Documenti analoghi