Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	De la Jara, Ignacio M., Rodriguez-Opazo, Cristian, Marrese-Taylor, Edison, Bravo-Marquez, Felipe
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2510.17007
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866911220411400192
author	De la Jara, Ignacio M. Rodriguez-Opazo, Cristian Marrese-Taylor, Edison Bravo-Marquez, Felipe
author_facet	De la Jara, Ignacio M. Rodriguez-Opazo, Cristian Marrese-Taylor, Edison Bravo-Marquez, Felipe
contents	Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.
format	Preprint
id	arxiv_https___arxiv_org_abs_2510_17007
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	An empirical study of the effect of video encoders on Temporal Video Grounding De la Jara, Ignacio M. Rodriguez-Opazo, Cristian Marrese-Taylor, Edison Bravo-Marquez, Felipe Computer Vision and Pattern Recognition Temporal video grounding is a fundamental task in computer vision, aiming to localize a natural language query in a long, untrimmed video. It has a key role in the scientific community, in part due to the large amount of video generated every day. Although we find extensive work in this task, we note that research remains focused on a small selection of video representations, which may lead to architectural overfitting in the long run. To address this issue, we propose an empirical study to investigate the impact of different video features on a classical architecture. We extract features for three well-known benchmarks, Charades-STA, ActivityNet-Captions and YouCookII, using video encoders based on CNNs, temporal reasoning and transformers. Our results show significant differences in the performance of our model by simply changing the video encoder, while also revealing clear patterns and errors derived from the use of certain features, ultimately indicating potential feature complementarity.
title	An empirical study of the effect of video encoders on Temporal Video Grounding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2510.17007

Similar Items