Saved in:
Bibliographic Details
Main Authors: Li, Jiaqi, Zheng, Shuntian, Shen, Yixian, Huang, Jia-Hong, Lu, Xiaoman, Ni, Minzhe, Guan, Yu
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.05663
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912946237472768
author Li, Jiaqi
Zheng, Shuntian
Shen, Yixian
Huang, Jia-Hong
Lu, Xiaoman
Ni, Minzhe
Guan, Yu
author_facet Li, Jiaqi
Zheng, Shuntian
Shen, Yixian
Huang, Jia-Hong
Lu, Xiaoman
Ni, Minzhe
Guan, Yu
contents Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
format Preprint
id arxiv_https___arxiv_org_abs_2603_05663
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
Li, Jiaqi
Zheng, Shuntian
Shen, Yixian
Huang, Jia-Hong
Lu, Xiaoman
Ni, Minzhe
Guan, Yu
Computer Vision and Pattern Recognition
Video Temporal Grounding (VTG) localizes the temporal boundaries of a query-relevant moment in long, untrimmed videos, making video-language-model (VLM) pipelines prohibitively expensive. While recent training-free visual token pruning has shown success in video question answering, naively applying these objectives to VTG often causes drastic degradation, as VTG crucially depends on boundary-sensitive evidence and cross-frame reasoning chains. We therefore identify two VTG-specific pruning principles: Evidence Retention (ER), which keeps query-critical patches especially around event boundaries, and Connectivity Strength (CS), which preserves token-level cross-frame connectivity for long-range evidence aggregation. Building on these insights, we propose SemVID, a training-free pruning framework that constructs a compact yet coherent token subset with complementary semantic roles. SemVID first allocates per-frame token budgets by balancing query relevance and inter-frame variation to avoid over-pruned segments, and then selects three types of tokens: object tokens for diverse query-critical evidence, motion tokens to capture meaningful transitions and serve as cross-frame relays, and a small set of context tokens for scene continuity. Extensive experiments on VTG benchmarks show that SemVID achieves a strong accuracy-efficiency trade-off, retaining up to 95.4% mIoU with only 12.5% visual tokens and delivering up to a 5.8x prefill speedup, consistently outperforming prior methods under the same budgets.
title Keeping the Evidence Chain: Semantic Evidence Allocation for Training-Free Token Pruning in Video Temporal Grounding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2603.05663