Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lando, Giuseppe, Forte, Rosario, Furnari, Antonino
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.22455
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916008293302272
author	Lando, Giuseppe Forte, Rosario Furnari, Antonino
author_facet	Lando, Giuseppe Forte, Rosario Furnari, Antonino
contents	We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
format	Preprint
id	arxiv_https___arxiv_org_abs_2602_22455
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge Lando, Giuseppe Forte, Rosario Furnari, Antonino Computer Vision and Pattern Recognition We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
title	Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2602.22455

Similar Items