Saved in:
| Main Authors: | Cong, Kaixuan, Wang, Yifan, Xue, Rongkun, Jiang, Yuyang, Feng, Yiming, Yang, Jing |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.09323 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation
by: Cong, Wei, et al.
Published: (2024)
by: Cong, Wei, et al.
Published: (2024)
Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
by: Rajasekhar, G, et al.
Published: (2024)
by: Rajasekhar, G, et al.
Published: (2024)
RFPPO: Motion Dynamic RRT based Fluid Field - PPO for Dynamic TF/TA Routing Planning
by: Xue, Rongkun, et al.
Published: (2024)
by: Xue, Rongkun, et al.
Published: (2024)
Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition
by: Shi, Tong, et al.
Published: (2024)
by: Shi, Tong, et al.
Published: (2024)
FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder
by: Li, Wei, et al.
Published: (2026)
by: Li, Wei, et al.
Published: (2026)
Hierarchical Augmentation and Distillation for Class Incremental Audio-Visual Video Recognition
by: Zuo, Yukun, et al.
Published: (2024)
by: Zuo, Yukun, et al.
Published: (2024)
When Eyes and Ears Disagree: Can MLLMs Discern Audio-Visual Confusion?
by: Ye, Qilang, et al.
Published: (2025)
by: Ye, Qilang, et al.
Published: (2025)
Contrastive Learning for Multimodal Human Activity Recognition with Limited Labeled Data
by: Jing, Long, et al.
Published: (2026)
by: Jing, Long, et al.
Published: (2026)
Pretrained Reversible Generation as Unsupervised Visual Representation Learning
by: Xue, Rongkun, et al.
Published: (2024)
by: Xue, Rongkun, et al.
Published: (2024)
InterActHuman: Multi-Concept Human Animation with Layout-Aligned Audio Conditions
by: Wang, Zhenzhi, et al.
Published: (2025)
by: Wang, Zhenzhi, et al.
Published: (2025)
ConfusionBench: An Expert-Validated Benchmark for Confusion Recognition and Localization in Educational Videos
by: Dong, Lu, et al.
Published: (2026)
by: Dong, Lu, et al.
Published: (2026)
Context-Aware Aerial Object Detection: Leveraging Inter-Object and Background Relationships
by: Ren, Botao, et al.
Published: (2024)
by: Ren, Botao, et al.
Published: (2024)
Vision Language Models for Dynamic Human Activity Recognition in Healthcare Settings
by: Abid, Abderrazek, et al.
Published: (2025)
by: Abid, Abderrazek, et al.
Published: (2025)
eMotions: A Large-Scale Dataset and Audio-Visual Fusion Network for Emotion Analysis in Short-form Videos
by: Wu, Xuecheng, et al.
Published: (2025)
by: Wu, Xuecheng, et al.
Published: (2025)
Dynamic Attention and Bi-directional Fusion for Safety Helmet Wearing Detection
by: Feng, Junwei, et al.
Published: (2024)
by: Feng, Junwei, et al.
Published: (2024)
DTFSal: Audio-Visual Dynamic Token Fusion for Video Saliency Prediction
by: Hooshanfar, Kiana, et al.
Published: (2025)
by: Hooshanfar, Kiana, et al.
Published: (2025)
Understanding Open-Set Recognition by Jacobian Norm and Inter-Class Separation
by: Park, Jaewoo, et al.
Published: (2022)
by: Park, Jaewoo, et al.
Published: (2022)
Noise-Tolerant Learning for Audio-Visual Action Recognition
by: Han, Haochen, et al.
Published: (2022)
by: Han, Haochen, et al.
Published: (2022)
Dynamic Multi-Target Fusion for Efficient Audio-Visual Navigation
by: Yu, Yinfeng, et al.
Published: (2025)
by: Yu, Yinfeng, et al.
Published: (2025)
Evaluating Attribute Confusion in Fashion Text-to-Image Generation
by: Liu, Ziyue, et al.
Published: (2025)
by: Liu, Ziyue, et al.
Published: (2025)
Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
by: Praveen, R. Gnana, et al.
Published: (2021)
by: Praveen, R. Gnana, et al.
Published: (2021)
The Comparability of Model Fusion to Measured Data in Confuser Rejection
by: Flynn, Conor, et al.
Published: (2025)
by: Flynn, Conor, et al.
Published: (2025)
DyCAF-Net: Dynamic Class-Aware Fusion Network
by: Jahin, Md Abrar, et al.
Published: (2025)
by: Jahin, Md Abrar, et al.
Published: (2025)
Fusion to Enhance: Fusion Visual Encoder to Enhance Multimodal Language Model
by: She, Yifei, et al.
Published: (2025)
by: She, Yifei, et al.
Published: (2025)
SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing
by: Liang, Sen, et al.
Published: (2026)
by: Liang, Sen, et al.
Published: (2026)
Confidence-driven Gradient Modulation for Multimodal Human Activity Recognition: A Dynamic Contrastive Dual-Path Learning Approach
by: Ji, Panpan, et al.
Published: (2025)
by: Ji, Panpan, et al.
Published: (2025)
Efficient Audio-Visual Fusion for Video Classification
by: Awan, Mahrukh, et al.
Published: (2024)
by: Awan, Mahrukh, et al.
Published: (2024)
Universal Incremental Learning: Mitigating Confusion from Inter- and Intra-task Distribution Randomness
by: Luo, Sheng, et al.
Published: (2025)
by: Luo, Sheng, et al.
Published: (2025)
ESG-Net: Event-Aware Semantic Guided Network for Dense Audio-Visual Event Localization
by: Li, Huilai, et al.
Published: (2025)
by: Li, Huilai, et al.
Published: (2025)
CLASP: Class-Adaptive Layer Fusion and Dual-Stage Pruning for Multimodal Large Language Models
by: Dang, Yunkai, et al.
Published: (2026)
by: Dang, Yunkai, et al.
Published: (2026)
WiFi based Human Fall and Activity Recognition using Transformer based Encoder Decoder and Graph Neural Networks
by: Cho, Younggeol, et al.
Published: (2025)
by: Cho, Younggeol, et al.
Published: (2025)
Triple Spectral Fusion for Sensor-based Human Activity Recognition
by: Zhang, Ye, et al.
Published: (2026)
by: Zhang, Ye, et al.
Published: (2026)
AdaFedFR: Federated Face Recognition with Adaptive Inter-Class Representation Learning
by: Qiu, Di, et al.
Published: (2024)
by: Qiu, Di, et al.
Published: (2024)
Class-Aware Mask-Guided Feature Refinement for Scene Text Recognition
by: Yang, Mingkun, et al.
Published: (2024)
by: Yang, Mingkun, et al.
Published: (2024)
IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition
by: Ji, Yuyang, et al.
Published: (2026)
by: Ji, Yuyang, et al.
Published: (2026)
InterMamba: Efficient Human-Human Interaction Generation with Adaptive Spatio-Temporal Mamba
by: Wu, Zizhao, et al.
Published: (2025)
by: Wu, Zizhao, et al.
Published: (2025)
Attend-Fusion: Efficient Audio-Visual Fusion for Video Classification
by: Awan, Mahrukh, et al.
Published: (2024)
by: Awan, Mahrukh, et al.
Published: (2024)
Visual Confused Deputy: Exploiting and Defending Perception Failures in Computer-Using Agents
by: Liu, Xunzhuo, et al.
Published: (2026)
by: Liu, Xunzhuo, et al.
Published: (2026)
Uni-Encoder Meets Multi-Encoders: Representation Before Fusion for Brain Tumor Segmentation with Missing Modalities
by: Song, Peibo, et al.
Published: (2026)
by: Song, Peibo, et al.
Published: (2026)
DisCo: Towards Distinct and Coherent Visual Encapsulation in Video MLLMs
by: Zhao, Jiahe, et al.
Published: (2025)
by: Zhao, Jiahe, et al.
Published: (2025)
Similar Items
-
Cs2K: Class-specific and Class-shared Knowledge Guidance for Incremental Semantic Segmentation
by: Cong, Wei, et al.
Published: (2024) -
Inconsistency-Aware Cross-Attention for Audio-Visual Fusion in Dimensional Emotion Recognition
by: Rajasekhar, G, et al.
Published: (2024) -
RFPPO: Motion Dynamic RRT based Fluid Field - PPO for Dynamic TF/TA Routing Planning
by: Xue, Rongkun, et al.
Published: (2024) -
Detail-Enhanced Intra- and Inter-modal Interaction for Audio-Visual Emotion Recognition
by: Shi, Tong, et al.
Published: (2024) -
FusionBERT: Multi-View Image-3D Retrieval via Cross-Attention Visual Fusion and Normal-Aware 3D Encoder
by: Li, Wei, et al.
Published: (2026)