Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Li, Jerry, Oh, Timothy, Hoang, Joseph, Veeramachaneni, Vardhit
Format:	Preprint
Published:	2025
Subjects:	Artificial Intelligence Multimedia
Online Access:	https://arxiv.org/abs/2507.15875
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909698359296000
author	Li, Jerry Oh, Timothy Hoang, Joseph Veeramachaneni, Vardhit
author_facet	Li, Jerry Oh, Timothy Hoang, Joseph Veeramachaneni, Vardhit
contents	Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to enhance noisy information retrieval and question-answering capabilities.
format	Preprint
id	arxiv_https___arxiv_org_abs_2507_15875
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	Differential Multimodal Transformers Li, Jerry Oh, Timothy Hoang, Joseph Veeramachaneni, Vardhit Artificial Intelligence Multimedia Small language models have gained significant popularity due to their efficiency and growing capabilities. However, incorporating additional modalities, such as vision, can exacerbate the challenge of limited context windows by introducing noise. Recent studies have highlighted that Transformer attention mechanisms often disproportionately focus on irrelevant contexts. In this work, we extend the Differential Attention mechanism, originally designed for text-only models, to the text-vision model PaliGemma. Our aim is to evaluate its ability to mitigate noisy information retrieval and reduce hallucinations. To this end, we fine-tuned the PaliGemma 3B model using LoRA, incorporating Differential Attention, and experimented with various parameter settings and configurations. We demonstrate that Differential Attention can be adapted and integrated into the fine-tuning of existing models to enhance noisy information retrieval and question-answering capabilities.
title	Differential Multimodal Transformers
topic	Artificial Intelligence Multimedia
url	https://arxiv.org/abs/2507.15875

Similar Items