Saved in:
Bibliographic Details
Main Authors: De la Jara, Ignacio M., Rodriguez-Opazo, Cristian, Marrese-Taylor, Edison, Bravo-Marquez, Felipe
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2510.17007
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911220411400192
author De la Jara, Ignacio M.
Rodriguez-Opazo, Cristian
Marrese-Taylor, Edison
Bravo-Marquez, Felipe
author_facet De la Jara, Ignacio M.
Rodriguez-Opazo, Cristian
Marrese-Taylor, Edison
Bravo-Marquez, Felipe
contents Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.
format Preprint
id arxiv_https___arxiv_org_abs_2510_17007
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle An empirical study of the effect of video encoders on Temporal Video Grounding
De la Jara, Ignacio M.
Rodriguez-Opazo, Cristian
Marrese-Taylor, Edison
Bravo-Marquez, Felipe
Computer Vision and Pattern Recognition
Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.
title An empirical study of the effect of video encoders on Temporal Video Grounding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2510.17007