Saved in:
Bibliographic Details
Main Authors: Ngong, Ivoline C., Reza, Zarreen, Near, Joseph P.
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.04894
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866910041780518912
author Ngong, Ivoline C.
Reza, Zarreen
Near, Joseph P.
author_facet Ngong, Ivoline C.
Reza, Zarreen
Near, Joseph P.
contents Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
format Preprint
id arxiv_https___arxiv_org_abs_2603_04894
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Differentially Private Multimodal In-Context Learning
Ngong, Ivoline C.
Reza, Zarreen
Near, Joseph P.
Artificial Intelligence
Vision-language models are increasingly applied to sensitive domains such as medical imaging and personal photographs, yet existing differentially private methods for in-context learning are limited to few-shot, text-only settings because privacy cost scales with the number of tokens processed. We present Differentially Private Multimodal Task Vectors (DP-MTV), the first framework enabling many-shot multimodal in-context learning with formal $(\varepsilon, δ)$-differential privacy by aggregating hundreds of demonstrations into compact task vectors in activation space. DP-MTV partitions private data into disjoint chunks, applies per-layer clipping to bound sensitivity, and adds calibrated noise to the aggregate, requiring only a single noise addition that enables unlimited inference queries. We evaluate on eight benchmarks across three VLM architectures, supporting deployment with or without auxiliary data. At $\varepsilon=1.0$, DP-MTV achieves 50% on VizWiz compared to 55% non-private and 35% zero-shot, preserving most of the gain from in-context learning under meaningful privacy constraints.
title Differentially Private Multimodal In-Context Learning
topic Artificial Intelligence
url https://arxiv.org/abs/2603.04894