Saved in:
Bibliographic Details
Main Author: Abdelwahab, Sherif
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.29631
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912992091701248
author Abdelwahab, Sherif
author_facet Abdelwahab, Sherif
contents Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.
format Preprint
id arxiv_https___arxiv_org_abs_2603_29631
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras
Abdelwahab, Sherif
Computer Vision and Pattern Recognition
Distributed, Parallel, and Cluster Computing
Information Retrieval
I.4.9; I.2.10
Always-on edge cameras generate continuous video streams where redundant frames degrade cross-modal retrieval by crowding correct results out of top-k search. This paper presents a streaming retrieval architecture: an on-device epsilon-net filter retains only semantically novel frames, building a denoised embedding index; a cross-modal adapter and cloud re-ranker compensate for the compact encoder's weak alignment. A single-pass streaming filter outperforms offline alternatives (k-means, farthest-point, uniform, random) across eight vision-language models (8M-632M) on two egocentric datasets (AEA, EPIC-KITCHENS). Combined, the architecture reaches 45.6% Hit@5 on held-out data using an 8M on-device encoder at an estimated 2.7 mW.
title Storing Less, Finding More: How Novelty Filtering Improves Cross-Modal Retrieval on Edge Cameras
topic Computer Vision and Pattern Recognition
Distributed, Parallel, and Cluster Computing
Information Retrieval
I.4.9; I.2.10
url https://arxiv.org/abs/2603.29631