Guardado en:
| Autores principales: | Chen, Tuochao, Veluri, Bandhav, Gong, Hongyu, Gollakota, Shyamnath |
|---|---|
| Formato: | Preprint |
| Publicado: |
2025
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2511.11124 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Look Once to Hear: Target Speech Hearing with Noisy Examples
por: Veluri, Bandhav, et al.
Publicado: (2024)
por: Veluri, Bandhav, et al.
Publicado: (2024)
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
por: Gong, Kaixiong, et al.
Publicado: (2024)
por: Gong, Kaixiong, et al.
Publicado: (2024)
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
por: Veluri, Bandhav, et al.
Publicado: (2024)
por: Veluri, Bandhav, et al.
Publicado: (2024)
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
por: Gong, Hongyu, et al.
Publicado: (2024)
por: Gong, Hongyu, et al.
Publicado: (2024)
Audio-Guided Visual Perception for Audio-Visual Navigation
por: Wang, Yi, et al.
Publicado: (2025)
por: Wang, Yi, et al.
Publicado: (2025)
OLKAVS: An Open Large-Scale Korean Audio-Visual Speech Dataset
por: Park, Jeongkyun, et al.
Publicado: (2023)
por: Park, Jeongkyun, et al.
Publicado: (2023)
LLMs Meet Multimodal Generation and Editing: A Survey
por: He, Yingqing, et al.
Publicado: (2024)
por: He, Yingqing, et al.
Publicado: (2024)
LongAV-Compass: Towards Unified Evaluation of Minute-Scale Audio-Visual Generation Across T2AV, I2AV, and V2AV
por: Liu, Tengfei, et al.
Publicado: (2026)
por: Liu, Tengfei, et al.
Publicado: (2026)
SeeingSounds: Learning Audio-to-Visual Alignment via Text
por: Carnemolla, Simone, et al.
Publicado: (2025)
por: Carnemolla, Simone, et al.
Publicado: (2025)
Audio-Omni: Extending Multi-modal Understanding to Versatile Audio Generation and Editing
por: Tian, Zeyue, et al.
Publicado: (2026)
por: Tian, Zeyue, et al.
Publicado: (2026)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
por: Zeng, Donghuo, et al.
Publicado: (2026)
por: Zeng, Donghuo, et al.
Publicado: (2026)
Diffusion Models for Joint Audio-Video Generation
por: La Torre, Alejandro Paredes
Publicado: (2026)
por: La Torre, Alejandro Paredes
Publicado: (2026)
Do Joint Audio-Video Generation Models Understand Physics?
por: Cui, Zijun, et al.
Publicado: (2026)
por: Cui, Zijun, et al.
Publicado: (2026)
AVBench: Human-Aligned and Automated Evaluation Benchmark for Audio-Video Generative Models
por: Yang, Jialiang, et al.
Publicado: (2026)
por: Yang, Jialiang, et al.
Publicado: (2026)
Multimodal Sentiment Analysis based on Video and Audio Inputs
por: Fernandez, Antonio, et al.
Publicado: (2024)
por: Fernandez, Antonio, et al.
Publicado: (2024)
MGM-Omni: Scaling Omni LLMs to Personalized Long-Horizon Speech
por: Wang, Chengyao, et al.
Publicado: (2025)
por: Wang, Chengyao, et al.
Publicado: (2025)
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling
por: Ye, Zhen, et al.
Publicado: (2026)
por: Ye, Zhen, et al.
Publicado: (2026)
Apollo: Unified Multi-Task Audio-Video Joint Generation
por: Wang, Jun, et al.
Publicado: (2026)
por: Wang, Jun, et al.
Publicado: (2026)
VGGSounder: Audio-Visual Evaluations for Foundation Models
por: Zverev, Daniil, et al.
Publicado: (2025)
por: Zverev, Daniil, et al.
Publicado: (2025)
AV-Edit: Multimodal Generative Sound Effect Editing via Audio-Visual Semantic Joint Control
por: Guo, Xinyue, et al.
Publicado: (2025)
por: Guo, Xinyue, et al.
Publicado: (2025)
TalkVerse: Democratizing Minute-Long Audio-Driven Video Generation
por: Wang, Zhenzhi, et al.
Publicado: (2025)
por: Wang, Zhenzhi, et al.
Publicado: (2025)
FlowPortrait: Reinforcement Learning for Audio-Driven Portrait Video Generation
por: Tan, Weiting, et al.
Publicado: (2026)
por: Tan, Weiting, et al.
Publicado: (2026)
SAVe: Self-Supervised Audio-visual Deepfake Detection Exploiting Visual Artifacts and Audio-visual Misalignment
por: Shahzad, Sahibzada Adil, et al.
Publicado: (2026)
por: Shahzad, Sahibzada Adil, et al.
Publicado: (2026)
DASH: Dynamic Audio-Driven Semantic Chunking for Efficient Omnimodal Token Compression
por: Li, Bingzhou, et al.
Publicado: (2026)
por: Li, Bingzhou, et al.
Publicado: (2026)
Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator
por: Kang, Minjae, et al.
Publicado: (2025)
por: Kang, Minjae, et al.
Publicado: (2025)
Object-AVEdit: An Object-level Audio-Visual Editing Model
por: Fu, Youquan, et al.
Publicado: (2025)
por: Fu, Youquan, et al.
Publicado: (2025)
Audio-Visual Segmentation via Unlabeled Frame Exploitation
por: Liu, Jinxiang, et al.
Publicado: (2024)
por: Liu, Jinxiang, et al.
Publicado: (2024)
SoundVista: Novel-View Ambient Sound Synthesis via Visual-Acoustic Binding
por: Chen, Mingfei, et al.
Publicado: (2025)
por: Chen, Mingfei, et al.
Publicado: (2025)
Sounding Highlights: Dual-Pathway Audio Encoders for Audio-Visual Video Highlight Detection
por: Joo, Seohyun, et al.
Publicado: (2026)
por: Joo, Seohyun, et al.
Publicado: (2026)
Audio-Enhanced Vision-Language Modeling with Latent Space Broadening for High Quality Data Expansion
por: Sun, Yu, et al.
Publicado: (2025)
por: Sun, Yu, et al.
Publicado: (2025)
SAV-SE: Scene-aware Audio-Visual Speech Enhancement with Selective State Space Model
por: Qian, Xinyuan, et al.
Publicado: (2024)
por: Qian, Xinyuan, et al.
Publicado: (2024)
MAVERIX: Multimodal Audio-Visual Evaluation and Recognition IndeX
por: Xie, Liuyue, et al.
Publicado: (2025)
por: Xie, Liuyue, et al.
Publicado: (2025)
AV-SUPERB: A Multi-Task Evaluation Benchmark for Audio-Visual Representation Models
por: Tseng, Yuan, et al.
Publicado: (2023)
por: Tseng, Yuan, et al.
Publicado: (2023)
Contextual Cross-Modal Attention for Audio-Visual Deepfake Detection and Localization
por: Katamneni, Vinaya Sree, et al.
Publicado: (2024)
por: Katamneni, Vinaya Sree, et al.
Publicado: (2024)
What's Making That Sound Right Now? Video-centric Audio-Visual Localization
por: Choi, Hahyeon, et al.
Publicado: (2025)
por: Choi, Hahyeon, et al.
Publicado: (2025)
Multilingual Audio-Visual Speech Recognition with Hybrid CTC/RNN-T Fast Conformer
por: Burchi, Maxime, et al.
Publicado: (2024)
por: Burchi, Maxime, et al.
Publicado: (2024)
A Survey of Recent Advances and Challenges in Deep Audio-Visual Correlation Learning
por: Vilaca, Luis, et al.
Publicado: (2024)
por: Vilaca, Luis, et al.
Publicado: (2024)
Enhancing Lie Detection Accuracy: A Comparative Study of Classic ML, CNN, and GCN Models using Audio-Visual Features
por: Abdelwahab, Abdelrahman, et al.
Publicado: (2024)
por: Abdelwahab, Abdelrahman, et al.
Publicado: (2024)
SyncVoice: Towards Video Dubbing with Vision-Augmented Pretrained TTS Model
por: Wang, Kaidi, et al.
Publicado: (2025)
por: Wang, Kaidi, et al.
Publicado: (2025)
VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
por: Ai, Zhiqi, et al.
Publicado: (2025)
por: Ai, Zhiqi, et al.
Publicado: (2025)
Ejemplares similares
-
Look Once to Hear: Target Speech Hearing with Noisy Examples
por: Veluri, Bandhav, et al.
Publicado: (2024) -
AV-Odyssey Bench: Can Your Multimodal LLMs Really Understand Audio-Visual Information?
por: Gong, Kaixiong, et al.
Publicado: (2024) -
Beyond Turn-Based Interfaces: Synchronous LLMs as Full-Duplex Dialogue Agents
por: Veluri, Bandhav, et al.
Publicado: (2024) -
SeamlessExpressiveLM: Speech Language Model for Expressive Speech-to-Speech Translation with Chain-of-Thought
por: Gong, Hongyu, et al.
Publicado: (2024) -
Audio-Guided Visual Perception for Audio-Visual Navigation
por: Wang, Yi, et al.
Publicado: (2025)