Saved in:
Bibliographic Details
Main Authors: Lando, Giuseppe, Forte, Rosario, Furnari, Antonino
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.22455
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866916008293302272
author Lando, Giuseppe
Forte, Rosario
Furnari, Antonino
author_facet Lando, Giuseppe
Forte, Rosario
Furnari, Antonino
contents We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
format Preprint
id arxiv_https___arxiv_org_abs_2602_22455
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
Lando, Giuseppe
Forte, Rosario
Furnari, Antonino
Computer Vision and Pattern Recognition
We investigate the feasibility of using Multimodal Large Language Models (MLLMs) for real-time online episodic memory question answering. While cloud offloading is common, it raises privacy and latency concerns for wearable assistants, hence we investigate implementation on the edge. We integrated streaming constraints into our question answering pipeline, which is structured into two asynchronous threads: a Descriptor Thread that continuously converts video into a lightweight textual memory, and a Question Answering (QA) Thread that reasons over the textual memory to answer queries. Experiments on the QAEgo4D-Closed benchmark analyze the performance of Multimodal Large Language Models (MLLMs) within strict resource boundaries, showing promising results also when compared to clound-based solutions. Specifically, an end-to-end configuration running on a consumer-grade 8GB GPU achieves 51.76% accuracy with a Time-To-First-Token (TTFT) of 0.41s. Scaling to a local enterprise-grade server yields 54.40% accuracy with a TTFT of 0.88s. In comparison, a cloud-based solution obtains an accuracy of 56.00%. These competitive results highlight the potential of edge-based solutions for privacy-preserving episodic memory retrieval.
title Exploring Multimodal LMMs for Online Episodic Memory Question Answering on the Edge
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.22455