Saved in:
| Main Authors: | Zhu, Xuanyu, Bai, Yan, Shi, Yang, Lou, Yihang, Zhang, Yuanxing, Jin, Jing, Zhou, Yuan |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.10780 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
by: Zhao, Xiran, et al.
Published: (2026)
by: Zhao, Xiran, et al.
Published: (2026)
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
by: Jin, Jing, et al.
Published: (2026)
by: Jin, Jing, et al.
Published: (2026)
Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion
by: Shen, Hui, et al.
Published: (2024)
by: Shen, Hui, et al.
Published: (2024)
Monet: Reasoning in Latent Visual Space Beyond Images and Language
by: Wang, Qixun, et al.
Published: (2025)
by: Wang, Qixun, et al.
Published: (2025)
Tilt and Average : Geometric Adjustment of the Last Layer for Recalibration
by: Cho, Gyusang, et al.
Published: (2024)
by: Cho, Gyusang, et al.
Published: (2024)
Beyond Static Cropping: Layer-Adaptive Visual Localization and Decoding Enhancement
by: Zhu, Zipeng, et al.
Published: (2026)
by: Zhu, Zipeng, et al.
Published: (2026)
Diagnosing and Repairing Unsafe Channels in Vision-Language Models via Causal Discovery and Dual-Modal Safety Subspace Projection
by: Fu, Jinhu, et al.
Published: (2026)
by: Fu, Jinhu, et al.
Published: (2026)
Text-Guided Layer Fusion Mitigates Hallucination in Multimodal LLMs
by: Lin, Chenchen, et al.
Published: (2026)
by: Lin, Chenchen, et al.
Published: (2026)
One Token Per Frame: Reconsidering Visual Bandwidth in World Models for VLA Policy
by: Tang, Zuojin, et al.
Published: (2026)
by: Tang, Zuojin, et al.
Published: (2026)
Prototype Fusion: A Training-Free Multi-Layer Approach to OOD Detection
by: Gul, Shreen, et al.
Published: (2026)
by: Gul, Shreen, et al.
Published: (2026)
CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models
by: Dang, Yunkai, et al.
Published: (2026)
by: Dang, Yunkai, et al.
Published: (2026)
Evading Visual Aphasia: Contrastive Adaptive Semantic Token Pruning for Vision-Language Models
by: Ma, Jie, et al.
Published: (2026)
by: Ma, Jie, et al.
Published: (2026)
Layer- and Timestep-Adaptive Differentiable Token Compression Ratios for Efficient Diffusion Transformers
by: You, Haoran, et al.
Published: (2024)
by: You, Haoran, et al.
Published: (2024)
Tracing Representation Progression: Analyzing and Enhancing Layer-Wise Similarity
by: Jiang, Jiachen, et al.
Published: (2024)
by: Jiang, Jiachen, et al.
Published: (2024)
One Layer Is Enough: Adapting Pretrained Visual Encoders for Image Generation
by: Gao, Yuan, et al.
Published: (2025)
by: Gao, Yuan, et al.
Published: (2025)
Multi-Granularity Vision Fastformer with Fusion Mechanism for Skin Lesion Segmentation
by: Liu, Xuanyu, et al.
Published: (2025)
by: Liu, Xuanyu, et al.
Published: (2025)
SeqVLM: Proposal-Guided Multi-View Sequences Reasoning via VLM for Zero-Shot 3D Visual Grounding
by: Lin, Jiawen, et al.
Published: (2025)
by: Lin, Jiawen, et al.
Published: (2025)
Multi-Layer Visual Feature Fusion in Multimodal LLMs: Methods, Analysis, and Best Practices
by: Lin, Junyan, et al.
Published: (2025)
by: Lin, Junyan, et al.
Published: (2025)
Integrative CAM: Adaptive Layer Fusion for Comprehensive Interpretation of CNNs
by: Singh, Aniket K., et al.
Published: (2024)
by: Singh, Aniket K., et al.
Published: (2024)
COXNet: Cross-Layer Fusion with Adaptive Alignment and Scale Integration for RGBT Tiny Object Detection
by: Peng, Peiran, et al.
Published: (2025)
by: Peng, Peiran, et al.
Published: (2025)
An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models
by: Chen, Liang, et al.
Published: (2024)
by: Chen, Liang, et al.
Published: (2024)
SwiftVLM: Efficient Vision-Language Model Inference via Cross-Layer Token Bypass
by: Qian, Chen, et al.
Published: (2026)
by: Qian, Chen, et al.
Published: (2026)
Representations of Text and Images Align From Layer One
by: Wybitul, Evžen, et al.
Published: (2026)
by: Wybitul, Evžen, et al.
Published: (2026)
Beyond Static Visual Tokens: Structured Sequential Visual Chain-of-Thought Reasoning
by: Guo, Guangfu, et al.
Published: (2026)
by: Guo, Guangfu, et al.
Published: (2026)
Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
by: Yu, Yinfeng, et al.
Published: (2025)
by: Yu, Yinfeng, et al.
Published: (2025)
CoViPAL: Layer-wise Contextualized Visual Token Pruning for Large Vision-Language Models
by: Tang, Zicong, et al.
Published: (2025)
by: Tang, Zicong, et al.
Published: (2025)
Uncertainty-Encoded Multi-Modal Fusion for Robust Object Detection in Autonomous Driving
by: Lou, Yang, et al.
Published: (2023)
by: Lou, Yang, et al.
Published: (2023)
MimicNorm: Weight Mean and Last BN Layer Mimic the Dynamic of Batch Normalization
by: Fei, Wen, et al.
Published: (2020)
by: Fei, Wen, et al.
Published: (2020)
AvatarShield: Visual Reinforcement Learning for Human-Centric Synthetic Video Detection
by: Xu, Zhipei, et al.
Published: (2025)
by: Xu, Zhipei, et al.
Published: (2025)
Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs
by: Zhang, Qizhe, et al.
Published: (2024)
by: Zhang, Qizhe, et al.
Published: (2024)
Beyond BEV: Optimizing Point-Level Tokens for Collaborative Perception
by: Li, Yang, et al.
Published: (2025)
by: Li, Yang, et al.
Published: (2025)
V2Flow: Unifying Visual Tokenization and Large Language Model Vocabularies for Autoregressive Image Generation
by: Zhang, Guiwei, et al.
Published: (2025)
by: Zhang, Guiwei, et al.
Published: (2025)
Variational Bayesian Last Layers
by: Harrison, James, et al.
Published: (2024)
by: Harrison, James, et al.
Published: (2024)
EvoCut: Multi-Layer Evolution-Aware Visual Token Compression for Efficient Large Vision-Language Models
by: Lu, Hongyu, et al.
Published: (2026)
by: Lu, Hongyu, et al.
Published: (2026)
FlattenGPT: Depth Compression for Transformer with Layer Flattening
by: Xu, Ruihan, et al.
Published: (2026)
by: Xu, Ruihan, et al.
Published: (2026)
Lightweight Multi-Scale Feature Extraction with Fully Connected LMF Layer for Salient Object Detection
by: Shi, Yunpeng, et al.
Published: (2025)
by: Shi, Yunpeng, et al.
Published: (2025)
What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
by: Jiang, Yutao, et al.
Published: (2025)
by: Jiang, Yutao, et al.
Published: (2025)
LLaVA-SP: Enhancing Visual Representation with Visual Spatial Tokens for MLLMs
by: Lou, Haoran, et al.
Published: (2025)
by: Lou, Haoran, et al.
Published: (2025)
UniFusion: A Unified Image Fusion Framework with Robust Representation and Source-Aware Preservation
by: Li, Xingyuan, et al.
Published: (2026)
by: Li, Xingyuan, et al.
Published: (2026)
LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors
by: Dalva, Yusuf, et al.
Published: (2024)
by: Dalva, Yusuf, et al.
Published: (2024)
Similar Items
-
MMVIAD: Multi-view Multi-task Video Understanding for Industrial Anomaly Detection
by: Zhao, Xiran, et al.
Published: (2026) -
Unveiling Fine-Grained Visual Traces: Evaluating Multimodal Interleaved Reasoning Chains in Multimodal STEM Tasks
by: Jin, Jing, et al.
Published: (2026) -
Famba-V: Fast Vision Mamba with Cross-Layer Token Fusion
by: Shen, Hui, et al.
Published: (2024) -
Monet: Reasoning in Latent Visual Space Beyond Images and Language
by: Wang, Qixun, et al.
Published: (2025) -
Tilt and Average : Geometric Adjustment of the Last Layer for Recalibration
by: Cho, Gyusang, et al.
Published: (2024)