Enregistré dans:
Détails bibliographiques
Auteurs principaux: Benavent-Lledo, Manuel, Bacharidis, Konstantinos, Papoutsakis, Konstantinos, Argyros, Antonis, Garcia-Rodriguez, Jose
Format: Preprint
Publié: 2026
Sujets:
Accès en ligne:https://arxiv.org/abs/2601.22039
Tags: Ajouter un tag
Pas de tags, Soyez le premier à ajouter un tag!
_version_ 1866917232874881024
author Benavent-Lledo, Manuel
Bacharidis, Konstantinos
Papoutsakis, Konstantinos
Argyros, Antonis
Garcia-Rodriguez, Jose
author_facet Benavent-Lledo, Manuel
Bacharidis, Konstantinos
Papoutsakis, Konstantinos
Argyros, Antonis
Garcia-Rodriguez, Jose
contents Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
format Preprint
id arxiv_https___arxiv_org_abs_2601_22039
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Understanding Multimodal Complementarity for Single-Frame Action Anticipation
Benavent-Lledo, Manuel
Bacharidis, Konstantinos
Papoutsakis, Konstantinos
Argyros, Antonis
Garcia-Rodriguez, Jose
Computer Vision and Pattern Recognition
Human action anticipation is commonly treated as a video understanding problem, implicitly assuming that dense temporal information is required to reason about future actions. In this work, we challenge this assumption by investigating what can be achieved when action anticipation is constrained to a single visual observation. We ask a fundamental question: how much information about the future is already encoded in a single frame, and how can it be effectively exploited? Building on our prior work on Action Anticipation at a Glimpse (AAG), we conduct a systematic investigation of single-frame action anticipation enriched with complementary sources of information. We analyze the contribution of RGB appearance, depth-based geometric cues, and semantic representations of past actions, and investigate how different multimodal fusion strategies, keyframe selection policies and past-action history sources influence anticipation performance. Guided by these findings, we consolidate the most effective design choices into AAG+, a refined single-frame anticipation framework. Despite operating on a single frame, AAG+ consistently improves upon the original AAG and achieves performance comparable to, or exceeding, that of state-of-the-art video-based methods on challenging anticipation benchmarks including IKEA-ASM, Meccano and Assembly101. Our results offer new insights into the limits and potential of single-frame action anticipation, and clarify when dense temporal modeling is necessary and when a carefully selected glimpse is sufficient.
title Understanding Multimodal Complementarity for Single-Frame Action Anticipation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.22039