Saved in:
| Main Authors: | Gong, Chao, Wang, Depeng, Wei, Zhipeng, Guo, Ya, Zhu, Huijia, Chen, Jingjing |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.10324 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
by: Wang, Nan, et al.
Published: (2026)
by: Wang, Nan, et al.
Published: (2026)
Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment
by: Xu, Rui, et al.
Published: (2025)
by: Xu, Rui, et al.
Published: (2025)
VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation
by: Han, Feng, et al.
Published: (2025)
by: Han, Feng, et al.
Published: (2025)
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
by: Gong, Chao, et al.
Published: (2024)
by: Gong, Chao, et al.
Published: (2024)
LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing
by: Wang, Langyu, et al.
Published: (2024)
by: Wang, Langyu, et al.
Published: (2024)
ImageAttributionBench: How Far Are We from Generalizable Attribution?
by: Mou, Tingshu, et al.
Published: (2026)
by: Mou, Tingshu, et al.
Published: (2026)
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
by: Zhou, Ziheng, et al.
Published: (2024)
by: Zhou, Ziheng, et al.
Published: (2024)
Mettle: Meta-Token Learning for Memory-Efficient Audio-Visual Adaptation
by: Zhou, Jinxing, et al.
Published: (2025)
by: Zhou, Jinxing, et al.
Published: (2025)
From Pixels to Tokens: Byte-Pair Encoding on Quantized Visual Modalities
by: Zhang, Wanpeng, et al.
Published: (2024)
by: Zhang, Wanpeng, et al.
Published: (2024)
EchoingECG: An Electrocardiogram Cross-Modal Model for Echocardiogram Tasks
by: Gao, Yuan, et al.
Published: (2025)
by: Gao, Yuan, et al.
Published: (2025)
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
by: Jin, Dian, et al.
Published: (2025)
by: Jin, Dian, et al.
Published: (2025)
DuMo: Dual Encoder Modulation Network for Precise Concept Erasure
by: Han, Feng, et al.
Published: (2025)
by: Han, Feng, et al.
Published: (2025)
AlignVSR: Audio-Visual Cross-Modal Alignment for Visual Speech Recognition
by: Liu, Zehua, et al.
Published: (2024)
by: Liu, Zehua, et al.
Published: (2024)
Adaptive Identification of Blurred Regions for Accurate Image Deblurring
by: Gao, Hu, et al.
Published: (2025)
by: Gao, Hu, et al.
Published: (2025)
VisionSelector: End-to-End Learnable Visual Token Compression for Efficient Multimodal LLMs
by: Zhu, Jiaying, et al.
Published: (2025)
by: Zhu, Jiaying, et al.
Published: (2025)
From Waveforms to Pixels: A Survey on Audio-Visual Segmentation
by: Li, Jia, et al.
Published: (2025)
by: Li, Jia, et al.
Published: (2025)
Learning Visual Affordance from Audio
by: Lu, Lidong, et al.
Published: (2025)
by: Lu, Lidong, et al.
Published: (2025)
Spatial and Frequency Domain Adaptive Fusion Network for Image Deblurring
by: Gao, Hu, et al.
Published: (2025)
by: Gao, Hu, et al.
Published: (2025)
Unsupervised Audio-Visual Segmentation with Modality Alignment
by: Bhosale, Swapnil, et al.
Published: (2024)
by: Bhosale, Swapnil, et al.
Published: (2024)
EchoInk-R1: Exploring Audio-Visual Reasoning in Multimodal LLMs via Reinforcement Learning
by: Xing, Zhenghao, et al.
Published: (2025)
by: Xing, Zhenghao, et al.
Published: (2025)
AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees
by: Li, Yuankai, et al.
Published: (2026)
by: Li, Yuankai, et al.
Published: (2026)
AVHBench: A Cross-Modal Hallucination Benchmark for Audio-Visual Large Language Models
by: Sung-Bin, Kim, et al.
Published: (2024)
by: Sung-Bin, Kim, et al.
Published: (2024)
Cross-Modal Binary Attention: An Energy-Efficient Fusion Framework for Audio-Visual Learning
by: Saleh, Mohamed, et al.
Published: (2026)
by: Saleh, Mohamed, et al.
Published: (2026)
Residual Cross-Modal Fusion Networks for Audio-Visual Navigation
by: Wang, Yi, et al.
Published: (2026)
by: Wang, Yi, et al.
Published: (2026)
PRIMED: Adaptive Modality Suppression for Referring Audio-Visual Segmentation via Biased Competition
by: He, Yuchen, et al.
Published: (2026)
by: He, Yuchen, et al.
Published: (2026)
A Systematic Study of Cross-Modal Typographic Attacks on Audio-Visual Reasoning
by: Chen, Tianle, et al.
Published: (2026)
by: Chen, Tianle, et al.
Published: (2026)
WinTok: A Win-Win Hybrid Tokenizer via Decomposing Visual Understanding and Generation with Transferable Tokens
by: Guo, Yiwei, et al.
Published: (2026)
by: Guo, Yiwei, et al.
Published: (2026)
EEPNet-V2: Patch-to-Pixel Solution for Efficient Cross-Modal Registration between LiDAR Point Cloud and Camera Image
by: Yue, Yuanchao, et al.
Published: (2025)
by: Yue, Yuanchao, et al.
Published: (2025)
Towards Flexible, Scalable, and Adaptive Multi-Modal Conditioned Face Synthesis
by: Ren, Jingjing, et al.
Published: (2023)
by: Ren, Jingjing, et al.
Published: (2023)
Towards Open-Vocabulary Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
Copyright Infringement Detection in Text-to-Image Diffusion Models via Differential Privacy
by: Man, Xiafeng, et al.
Published: (2025)
by: Man, Xiafeng, et al.
Published: (2025)
UniFlow: A Unified Pixel Flow Tokenizer for Visual Understanding and Generation
by: Yue, Zhengrong, et al.
Published: (2025)
by: Yue, Zhengrong, et al.
Published: (2025)
EVATok: Adaptive Length Video Tokenization for Efficient Visual Autoregressive Generation
by: Xiong, Tianwei, et al.
Published: (2026)
by: Xiong, Tianwei, et al.
Published: (2026)
EchoPrune: Interpreting Redundancy as Temporal Echoes for Efficient VideoLLMs
by: Li, Jiameng, et al.
Published: (2026)
by: Li, Jiameng, et al.
Published: (2026)
HAVT-IVD: Heterogeneity-Aware Cross-Modal Network for Audio-Visual Surveillance: Idling Vehicles Detection With Multichannel Audio and Multiscale Visual Cues
by: Li, Xiwen, et al.
Published: (2025)
by: Li, Xiwen, et al.
Published: (2025)
Improving Visual Token Reduction via Rectifying Distortions for Efficient Multimodal LLM Inference
by: Cho, Hyeonwoo, et al.
Published: (2026)
by: Cho, Hyeonwoo, et al.
Published: (2026)
MoLT: Mixture of Layer-Wise Tokens for Efficient Audio-Visual Learning
by: Rho, Kyeongha, et al.
Published: (2025)
by: Rho, Kyeongha, et al.
Published: (2025)
Enhancing Audio-Visual Spiking Neural Networks through Semantic-Alignment and Cross-Modal Residual Learning
by: He, Xiang, et al.
Published: (2025)
by: He, Xiang, et al.
Published: (2025)
Motion-Aware Adaptive Pixel Pruning for Efficient Local Motion Deblurring
by: Shang, Wei, et al.
Published: (2025)
by: Shang, Wei, et al.
Published: (2025)
Similar Items
-
PixelPrune: Pixel-Level Adaptive Visual Token Reduction via Predictive Coding
by: Wang, Nan, et al.
Published: (2026) -
Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment
by: Xu, Rui, et al.
Published: (2025) -
VCE: Safe Autoregressive Image Generation via Visual Contrast Exploitation
by: Han, Feng, et al.
Published: (2025) -
Reliable and Efficient Concept Erasure of Text-to-Image Diffusion Models
by: Gong, Chao, et al.
Published: (2024) -
LINK: Adaptive Modality Interaction for Audio-Visual Video Parsing
by: Wang, Langyu, et al.
Published: (2024)