Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Huang, Ruoxiang, Ma, Xindian, Kong, Rundong, Yuan, Zhen, Zhang, Peng
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2511.00821
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908624616423424
author	Huang, Ruoxiang Ma, Xindian Kong, Rundong Yuan, Zhen Zhang, Peng
author_facet	Huang, Ruoxiang Ma, Xindian Kong, Rundong Yuan, Zhen Zhang, Peng
contents	Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA introduces Global Adaptive Encoding Step Scaling (GAESS), which adaptively adjusts the position encoding step size of visual tokens based on the embedding entropy of both modalities. Experimental results demonstrate that OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks. On visual-intensive tasks, OMEGA achieves up to 3.43% improvement over baseline position encoding strategies on Qwen2.5-VL-3B, with consistent gains observed across larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.
format	Preprint
id	arxiv_https___arxiv_org_abs_2511_00821
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models Huang, Ruoxiang Ma, Xindian Kong, Rundong Yuan, Zhen Zhang, Peng Computer Vision and Pattern Recognition Vision-Language Models (VLMs) have demonstrated strong performance across various multimodal tasks, where position encoding plays a vital role in modeling both the sequential structure of textual information and the spatial structure of visual information. However, current VLMs commonly adopt modality-unified 1D or 2D positional indexing strategies, which treat textual and visual tokens uniformly without accounting for their distinct structural properties and sequential continuity for text and spatial coherence for vision. To address this limitation, we propose OMEGA, a novel position encoding framework that employs Modality-Specific Position Encoding (MSPE) to assign positional indices while preserving the inherent structures of each modality across separate coordinate dimensions. Additionally, to align the information density of multimodal data in the positional index space, OMEGA introduces Global Adaptive Encoding Step Scaling (GAESS), which adaptively adjusts the position encoding step size of visual tokens based on the embedding entropy of both modalities. Experimental results demonstrate that OMEGA consistently enhances VLM performance across diverse architectures and VQA benchmarks. On visual-intensive tasks, OMEGA achieves up to 3.43% improvement over baseline position encoding strategies on Qwen2.5-VL-3B, with consistent gains observed across larger models including Qwen2.5-VL-7B and LLaVA-v1.5-7B.
title	OMEGA: Optimized Multimodal Position Encoding Index Derivation with Global Adaptive Scaling for Vision-Language Models
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2511.00821

Similar Items