Saved in:
| Main Authors: | Huang, Ruoxiang, Ma, Xindian, Kong, Rundong, Yuan, Zhen, Zhang, Peng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.00821 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
by: Huang, Ruoxiang, et al.
Published: (2026)
by: Huang, Ruoxiang, et al.
Published: (2026)
Revisiting Multimodal Positional Encoding in Vision-Language Models
by: Huang, Jie, et al.
Published: (2025)
by: Huang, Jie, et al.
Published: (2025)
ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
by: Ma, Xindian, et al.
Published: (2026)
by: Ma, Xindian, et al.
Published: (2026)
Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
by: He, Guangzhao, et al.
Published: (2026)
by: He, Guangzhao, et al.
Published: (2026)
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
by: Ge, Junqi, et al.
Published: (2024)
by: Ge, Junqi, et al.
Published: (2024)
Accelerating Video Generation Inference with Sequential-Parallel 3D Positional Encoding Using a Global Time Index
by: Yuan, Chao, et al.
Published: (2026)
by: Yuan, Chao, et al.
Published: (2026)
DAE-Fuse: An Adaptive Discriminative Autoencoder for Multi-Modality Image Fusion
by: Guo, Yuchen, et al.
Published: (2024)
by: Guo, Yuchen, et al.
Published: (2024)
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
by: Zhang, Peng-Fei, et al.
Published: (2026)
by: Zhang, Peng-Fei, et al.
Published: (2026)
Positional Encoding Field
by: Bai, Yunpeng, et al.
Published: (2025)
by: Bai, Yunpeng, et al.
Published: (2025)
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
by: Zhang, Jieyu, et al.
Published: (2024)
by: Zhang, Jieyu, et al.
Published: (2024)
Weierstrass Positional Encoding for Vision Transformers
by: Xin, Zhihang, et al.
Published: (2026)
by: Xin, Zhihang, et al.
Published: (2026)
An Investigation on The Position Encoding in Vision-Based Dynamics Prediction
by: Zhu, Jiageng, et al.
Published: (2024)
by: Zhu, Jiageng, et al.
Published: (2024)
LLaVAShield: Safeguarding Multimodal Multi-Turn Dialogues in Vision-Language Models
by: Huang, Guolei, et al.
Published: (2025)
by: Huang, Guolei, et al.
Published: (2025)
A 2D Semantic-Aware Position Encoding for Vision Transformers
by: Chen, Xi, et al.
Published: (2025)
by: Chen, Xi, et al.
Published: (2025)
CLoVe: Encoding Compositional Language in Contrastive Vision-Language Models
by: Castro, Santiago, et al.
Published: (2024)
by: Castro, Santiago, et al.
Published: (2024)
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding
by: Chen, Zhanpeng, et al.
Published: (2025)
by: Chen, Zhanpeng, et al.
Published: (2025)
MFC-Bench: Benchmarking Multimodal Fact-Checking with Large Vision-Language Models
by: Wang, Shengkang, et al.
Published: (2024)
by: Wang, Shengkang, et al.
Published: (2024)
Multi-View Large Reconstruction Model via Geometry-Aware Positional Encoding and Attention
by: Li, Mengfei, et al.
Published: (2024)
by: Li, Mengfei, et al.
Published: (2024)
Do Pre-trained Vision-Language Models Encode Object States?
by: Newman, Kaleb, et al.
Published: (2024)
by: Newman, Kaleb, et al.
Published: (2024)
PD-APE: A Parallel Decoding Framework with Adaptive Position Encoding for 3D Visual Grounding
by: Hou, Chenshu, et al.
Published: (2024)
by: Hou, Chenshu, et al.
Published: (2024)
OMEGA-Avatar: One-shot Modeling of 360° Gaussian Avatars
by: Xia, Zehao, et al.
Published: (2026)
by: Xia, Zehao, et al.
Published: (2026)
Human-inspired Global-to-Parallel Multi-scale Encoding for Lightweight Vision Models
by: Xu, Wei
Published: (2026)
by: Xu, Wei
Published: (2026)
AMMKD: Adaptive Multimodal Multi-teacher Distillation for Lightweight Vision-Language Models
by: Li, Yuqi, et al.
Published: (2025)
by: Li, Yuqi, et al.
Published: (2025)
ChatEarthNet: A Global-Scale Image-Text Dataset Empowering Vision-Language Geo-Foundation Models
by: Yuan, Zhenghang, et al.
Published: (2024)
by: Yuan, Zhenghang, et al.
Published: (2024)
Cameras as Relative Positional Encoding
by: Li, Ruilong, et al.
Published: (2025)
by: Li, Ruilong, et al.
Published: (2025)
Vision-Driven Prompt Optimization for Large Language Models in Multimodal Generative Tasks
by: Franklin, Leo, et al.
Published: (2025)
by: Franklin, Leo, et al.
Published: (2025)
GPO-V: Jailbreak Diffusion Vision Language Model by Global Probability Optimization
by: Pan, Yu, et al.
Published: (2026)
by: Pan, Yu, et al.
Published: (2026)
Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation
by: Li, Jiaye, et al.
Published: (2025)
by: Li, Jiaye, et al.
Published: (2025)
Data Adaptive Traceback for Vision-Language Foundation Models in Image Classification
by: Peng, Wenshuo, et al.
Published: (2024)
by: Peng, Wenshuo, et al.
Published: (2024)
PEVLM: Parallel Encoding for Vision-Language Models
by: Kang, Letian, et al.
Published: (2025)
by: Kang, Letian, et al.
Published: (2025)
Parabolic Position Encoding: Vision-Centric, Principled, Extrapolatable, General
by: Øhrstrøm, Christoffer Koo, et al.
Published: (2026)
by: Øhrstrøm, Christoffer Koo, et al.
Published: (2026)
Investigating and Mitigating the Multimodal Hallucination Snowballing in Large Vision-Language Models
by: Zhong, Weihong, et al.
Published: (2024)
by: Zhong, Weihong, et al.
Published: (2024)
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)
by: Chen, Pingyi, et al.
Published: (2025)
Vision-Centric Activation and Coordination for Multimodal Large Language Models
by: Wang, Yunnan, et al.
Published: (2025)
by: Wang, Yunnan, et al.
Published: (2025)
VEQ: Modality-Adaptive Quantization for MoE Vision-Language Models
by: Qin, Guangshuo, et al.
Published: (2026)
by: Qin, Guangshuo, et al.
Published: (2026)
PISA-Bench: The PISA Index as a Multilingual and Multimodal Metric for the Evaluation of Vision-Language Models
by: Haller, Patrick, et al.
Published: (2025)
by: Haller, Patrick, et al.
Published: (2025)
Automatic Robotic Development through Collaborative Framework by Large Language Models
by: Luan, Zhirong, et al.
Published: (2024)
by: Luan, Zhirong, et al.
Published: (2024)
PPE: Positional Preservation Embedding for Token Compression in Multimodal Large Language Models
by: Huang, Mouxiao, et al.
Published: (2025)
by: Huang, Mouxiao, et al.
Published: (2025)
MLLM-CL: Continual Learning for Multimodal Large Language Models
by: Zhao, Hongbo, et al.
Published: (2025)
by: Zhao, Hongbo, et al.
Published: (2025)
Skip-Vision: Efficient and Scalable Acceleration of Vision-Language Models via Adaptive Token Skipping
by: Zeng, Weili, et al.
Published: (2025)
by: Zeng, Weili, et al.
Published: (2025)
Similar Items
-
MODIX: A Training-Free Multimodal Information-Driven Positional Index Scaling for Vision-Language Models
by: Huang, Ruoxiang, et al.
Published: (2026) -
Revisiting Multimodal Positional Encoding in Vision-Language Models
by: Huang, Jie, et al.
Published: (2025) -
ID-LoRA: Efficient Low-Rank Adaptation Inspired by Matrix Interpolative Decomposition
by: Ma, Xindian, et al.
Published: (2026) -
Thinking in Blender: Staged Executable Inverse Graphics with Vision-Language Models
by: He, Guangzhao, et al.
Published: (2026) -
V2PE: Improving Multimodal Long-Context Capability of Vision-Language Models with Variable Visual Position Encoding
by: Ge, Junqi, et al.
Published: (2024)