Affichage MARC: :: Library Catalog

Enregistré dans:

Détails bibliographiques
Auteurs principaux:	Benavent-Lledo, Manuel, Bacharidis, Konstantinos, Papoutsakis, Konstantinos, Argyros, Antonis, Garcia-Rodriguez, Jose
Format:	Preprint
Publié:	2026
Sujets:	Computer Vision and Pattern Recognition
Accès en ligne:	https://arxiv.org/abs/2601.22039
Tags:	Ajouter un tag Pas de tags, Soyez le premier à ajouter un tag!

_version_	1866917232874881024
author	Benavent-Lledo, Manuel Bacharidis, Konstantinos Papoutsakis, Konstantinos Argyros, Antonis Garcia-Rodriguez, Jose
author_facet	Benavent-Lledo, Manuel Bacharidis, Konstantinos Papoutsakis, Konstantinos Argyros, Antonis Garcia-Rodriguez, Jose
contents	Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_22039
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Understanding Multimodal Complementarity for Single-Frame Action Anticipation Benavent-Lledo, Manuel Bacharidis, Konstantinos Papoutsakis, Konstantinos Argyros, Antonis Garcia-Rodriguez, Jose Computer Vision and Pattern Recognition Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
title	Understanding Multimodal Complementarity for Single-Frame Action Anticipation
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2601.22039

Documents similaires