Saved in:
| Main Authors: | Zhou, Jinxing, Li, Zhihui, Yu, Yongqiang, Zhou, Yanghao, Guo, Ruohao, Li, Guangyao, Mao, Yuxin, Han, Mingfei, Chang, Xiaojun, Wang, Meng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2506.23271 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
by: Jin, Dian, et al.
Published: (2025)
by: Jin, Dian, et al.
Published: (2025)
Towards Open-Vocabulary Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2025)
by: Zhou, Jinxing, et al.
Published: (2025)
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2025)
by: Zhou, Jinxing, et al.
Published: (2025)
Self-Consistency as a Free Lunch: Reducing Hallucinations in Vision-Language Models via Self-Reflection
by: Han, Mingfei, et al.
Published: (2025)
by: Han, Mingfei, et al.
Published: (2025)
Token Painter: Training-Free Text-Guided Image Inpainting via Mask Autoregressive Models
by: Jiang, Longtao, et al.
Published: (2025)
by: Jiang, Longtao, et al.
Published: (2025)
Audio-Visual Instance Segmentation
by: Guo, Ruohao, et al.
Published: (2023)
by: Guo, Ruohao, et al.
Published: (2023)
Dense Audio-Visual Event Localization under Cross-Modal Consistency and Multi-Temporal Granularity Collaboration
by: Zhou, Ziheng, et al.
Published: (2024)
by: Zhou, Ziheng, et al.
Published: (2024)
CoNav: Collaborative Cross-Modal Reasoning for Embodied Navigation
by: Hao, Haihong, et al.
Published: (2025)
by: Hao, Haihong, et al.
Published: (2025)
Audit After Segmentation: Reference-Free Mask Quality Assessment for Language-Referred Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2026)
by: Zhou, Jinxing, et al.
Published: (2026)
Beyond Dense Futures: World Models as Structured Planners for Robotic Manipulation
by: Jin, Minghao, et al.
Published: (2026)
by: Jin, Minghao, et al.
Published: (2026)
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
User-Feedback-Driven Adaptation for Vision-and-Language Navigation
by: Yu, Yongqiang, et al.
Published: (2025)
by: Yu, Yongqiang, et al.
Published: (2025)
LatentPilot: Scene-Aware Vision-and-Language Navigation by Dreaming Ahead with Latent Visual Reasoning
by: Hao, Haihong, et al.
Published: (2026)
by: Hao, Haihong, et al.
Published: (2026)
Patch-level Sounding Object Tracking for Audio-Visual Question Answering
by: Li, Zhangbin, et al.
Published: (2024)
by: Li, Zhangbin, et al.
Published: (2024)
Mettl14 and Mettl3 Work Cooperatively to Regulate Retinal Development
by: Dan Chen, et al.
Published: (2024)
by: Dan Chen, et al.
Published: (2024)
Face-Guided Sentiment Boundary Enhancement for Weakly-Supervised Temporal Sentiment Localization
by: Han, Cailing, et al.
Published: (2026)
by: Han, Cailing, et al.
Published: (2026)
Crab: A Unified Audio-Visual Scene Understanding Model with Explicit Cooperation
by: Du, Henghui, et al.
Published: (2025)
by: Du, Henghui, et al.
Published: (2025)
Multimodal Class-aware Semantic Enhancement Network for Audio-Visual Video Parsing
by: Zhao, Pengcheng, et al.
Published: (2024)
by: Zhao, Pengcheng, et al.
Published: (2024)
Look, Listen and Segment: Towards Weakly Supervised Audio-visual Semantic Segmentation
by: Li, Chengzhi, et al.
Published: (2026)
by: Li, Chengzhi, et al.
Published: (2026)
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
by: Wang, Yaoting, et al.
Published: (2024)
by: Wang, Yaoting, et al.
Published: (2024)
KTV: Keyframes and Key Tokens Selection for Efficient Training-Free Video LLMs
by: Song, Baiyang, et al.
Published: (2026)
by: Song, Baiyang, et al.
Published: (2026)
TeMTG: Text-Enhanced Multi-Hop Temporal Graph Modeling for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
LongVLM: Efficient Long Video Understanding via Large Language Models
by: Weng, Yuetian, et al.
Published: (2024)
by: Weng, Yuetian, et al.
Published: (2024)
Mettl3‐Mediated m6A Methylation of Pdgfrb Regulates the Angiogenesis‐Dependent Bone Formation
by: Shijie Zhou, et al.
Published: (2026)
by: Shijie Zhou, et al.
Published: (2026)
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
Boosting Audio Visual Question Answering via Key Semantic-Aware Cues
by: Li, Guangyao, et al.
Published: (2024)
by: Li, Guangyao, et al.
Published: (2024)
Progressive Online Video Understanding with Evidence-Aligned Timing and Transparent Decisions
by: Zhang, Kecheng, et al.
Published: (2026)
by: Zhang, Kecheng, et al.
Published: (2026)
Meta-Tuning LLMs to Leverage Lexical Knowledge for Generalizable Language Style Understanding
by: Guo, Ruohao, et al.
Published: (2023)
by: Guo, Ruohao, et al.
Published: (2023)
Open-Vocabulary Audio-Visual Semantic Segmentation
by: Guo, Ruohao, et al.
Published: (2024)
by: Guo, Ruohao, et al.
Published: (2024)
TEn-CATG:Text-Enriched Audio-Visual Video Parsing with Multi-Scale Category-Aware Temporal Graph
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
Target Speaker Lipreading by Audio-Visual Self-Distillation Pretraining and Speaker Adaptation
by: Zhang, Jing-Xuan, et al.
Published: (2025)
by: Zhang, Jing-Xuan, et al.
Published: (2025)
See, Plan, Rewind: Progress-Aware Vision-Language-Action Models for Robust Robotic Manipulation
by: Dai, Tingjun, et al.
Published: (2026)
by: Dai, Tingjun, et al.
Published: (2026)
Learning Spatial Decay for Vision Transformers
by: Mao, Yuxin, et al.
Published: (2025)
by: Mao, Yuxin, et al.
Published: (2025)
Meta-PINNs: Meta-Learning Enhanced Physics-Informed Machine Learning Framework for Turbomachinery Flow Predictions under Varying Operation Conditions
by: Han, Yuling, et al.
Published: (2026)
by: Han, Yuling, et al.
Published: (2026)
Not All Tokens Matter: Towards Efficient LLM Reasoning via Token Significance in Reinforcement Learning
by: Liu, Hanbing, et al.
Published: (2025)
by: Liu, Hanbing, et al.
Published: (2025)
A Minibatch-SGD-Based Learning Meta-Policy for Inventory Systems with Myopic Optimal Policy
by: Lyu, Jiameng, et al.
Published: (2024)
by: Lyu, Jiameng, et al.
Published: (2024)
Efficient Training for Human Video Generation with Entropy-Guided Prioritized Progressive Learning
by: Li, Changlin, et al.
Published: (2025)
by: Li, Changlin, et al.
Published: (2025)
Prompting Segmentation with Sound Is Generalizable Audio-Visual Source Localizer
by: Wang, Yaoting, et al.
Published: (2023)
by: Wang, Yaoting, et al.
Published: (2023)
Similar Items
-
SimToken: A Simple Baseline for Referring Audio-Visual Segmentation
by: Jin, Dian, et al.
Published: (2025) -
Towards Open-Vocabulary Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2024) -
Think Before You Segment: An Object-aware Reasoning Agent for Referring Audio-Visual Segmentation
by: Zhou, Jinxing, et al.
Published: (2025) -
Label-anticipated Event Disentanglement for Audio-Visual Video Parsing
by: Zhou, Jinxing, et al.
Published: (2024) -
CLASP: Cross-modal Salient Anchor-based Semantic Propagation for Weakly-supervised Dense Audio-Visual Event Localization
by: Zhou, Jinxing, et al.
Published: (2025)