Saved in:
| Main Authors: | Eliav, Amit, Gannot, Sharon |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.01774 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach
by: Eliav, Amit, et al.
Published: (2024)
by: Eliav, Amit, et al.
Published: (2024)
Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges
by: Mingote, Victoria, et al.
Published: (2024)
by: Mingote, Victoria, et al.
Published: (2024)
Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training
by: Tao, Ruijie, et al.
Published: (2024)
by: Tao, Ruijie, et al.
Published: (2024)
Listening for "You": Enhancing Speech Image Retrieval via Target Speaker Extraction
by: Yang, Wenhao, et al.
Published: (2025)
by: Yang, Wenhao, et al.
Published: (2025)
Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention
by: Salaj, Ina, et al.
Published: (2025)
by: Salaj, Ina, et al.
Published: (2025)
CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
by: Yue, Xianghu, et al.
Published: (2024)
by: Yue, Xianghu, et al.
Published: (2024)
UNQA: Unified No-Reference Quality Assessment for Audio, Image, Video, and Audio-Visual Content
by: Cao, Yuqin, et al.
Published: (2024)
by: Cao, Yuqin, et al.
Published: (2024)
Efficient Face Detection with Audio-Based Region Proposals for Human-Robot Interactions
by: Aris, William, et al.
Published: (2023)
by: Aris, William, et al.
Published: (2023)
KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario
by: Zhou, Huali, et al.
Published: (2024)
by: Zhou, Huali, et al.
Published: (2024)
SingIt! Singer Voice Transformation
by: Eliav, Amit, et al.
Published: (2024)
by: Eliav, Amit, et al.
Published: (2024)
Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
by: Ishikawa, Yuchi, et al.
Published: (2025)
by: Ishikawa, Yuchi, et al.
Published: (2025)
Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
by: Farhadipour, Aref, et al.
Published: (2025)
by: Farhadipour, Aref, et al.
Published: (2025)
Event2Audio: Event-Based Optical Vibration Sensing
by: Cai, Mingxuan, et al.
Published: (2025)
by: Cai, Mingxuan, et al.
Published: (2025)
Multimodal sensor fusion for real-time location-dependent defect detection in laser-directed energy deposition
by: Chen, Lequn, et al.
Published: (2023)
by: Chen, Lequn, et al.
Published: (2023)
Listening without Looking: Modality Bias in Audio-Visual Captioning
by: Ishikawa, Yuchi, et al.
Published: (2025)
by: Ishikawa, Yuchi, et al.
Published: (2025)
Speakers Localization Using Batch EM In Unfolding Neural Network
by: Veler, Rina, et al.
Published: (2026)
by: Veler, Rina, et al.
Published: (2026)
TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization
by: Malard, Hugo, et al.
Published: (2024)
by: Malard, Hugo, et al.
Published: (2024)
DLIOS: An LLM-Augmented Real-Time Multi-Modal Interactive Enhancement Overlay System for Douyin Live Streaming
by: Wen, Shuide, et al.
Published: (2026)
by: Wen, Shuide, et al.
Published: (2026)
BUT System Description for CHiME-9 MCoRec Challenge
by: Klement, Dominik, et al.
Published: (2026)
by: Klement, Dominik, et al.
Published: (2026)
Bounds on Agreement between Subjective and Objective Measurements
by: Pieper, Jaden, et al.
Published: (2026)
by: Pieper, Jaden, et al.
Published: (2026)
Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
by: Kim, Sungnyun, et al.
Published: (2024)
by: Kim, Sungnyun, et al.
Published: (2024)
Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation
by: Berghi, Davide, et al.
Published: (2024)
by: Berghi, Davide, et al.
Published: (2024)
Multimodal Marvels of Deep Learning in Medical Diagnosis: A Comprehensive Review of COVID-19 Detection
by: Islam, Md Shofiqul, et al.
Published: (2025)
by: Islam, Md Shofiqul, et al.
Published: (2025)
A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning
by: Jin, Liuyi, et al.
Published: (2025)
by: Jin, Liuyi, et al.
Published: (2025)
AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
by: Yeo, Jeong Hun, et al.
Published: (2023)
by: Yeo, Jeong Hun, et al.
Published: (2023)
Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling
by: Chen, Xuanjun, et al.
Published: (2025)
by: Chen, Xuanjun, et al.
Published: (2025)
Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
by: Praveen, R. Gnana, et al.
Published: (2021)
by: Praveen, R. Gnana, et al.
Published: (2021)
Binaural Target Speaker Extraction using Individualized HRTF
by: Ellinson, Yoav, et al.
Published: (2025)
by: Ellinson, Yoav, et al.
Published: (2025)
Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
by: Buitrago, Pol, et al.
Published: (2026)
by: Buitrago, Pol, et al.
Published: (2026)
HRTF-guided Binaural Target Speaker Extraction with Real-World Validation
by: Ellinson, Yoav, et al.
Published: (2026)
by: Ellinson, Yoav, et al.
Published: (2026)
Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments
by: Opochinsky, Renana, et al.
Published: (2024)
by: Opochinsky, Renana, et al.
Published: (2024)
Improvement Of Audiovisual Quality Estimation Using A Nonlinear Autoregressive Exogenous Neural Network And Bitstream Parameters
by: Kossi, Koffi, et al.
Published: (2024)
by: Kossi, Koffi, et al.
Published: (2024)
Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition
by: Park, Jay, et al.
Published: (2025)
by: Park, Jay, et al.
Published: (2025)
The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models
by: Wang, Yi, et al.
Published: (2025)
by: Wang, Yi, et al.
Published: (2025)
Multimodal Biomarkers for Schizophrenia: Towards Individual Symptom Severity Estimation
by: Premananth, Gowtham, et al.
Published: (2025)
by: Premananth, Gowtham, et al.
Published: (2025)
TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation
by: Kim, Ji-Hoon, et al.
Published: (2025)
by: Kim, Ji-Hoon, et al.
Published: (2025)
PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2025)
by: Liu, Huadai, et al.
Published: (2025)
Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos
by: Berghi, Davide, et al.
Published: (2025)
by: Berghi, Davide, et al.
Published: (2025)
Linearly Constrained Deep Beamformer for Multi-Speaker Scenarios
by: Zaidel, Ilai, et al.
Published: (2026)
by: Zaidel, Ilai, et al.
Published: (2026)
peerRTF: Robust MVDR Beamforming Using Graph Convolutional Network
by: Levi, Daniel, et al.
Published: (2024)
by: Levi, Daniel, et al.
Published: (2024)
Similar Items
-
Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach
by: Eliav, Amit, et al.
Published: (2024) -
Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges
by: Mingote, Victoria, et al.
Published: (2024) -
Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training
by: Tao, Ruijie, et al.
Published: (2024) -
Listening for "You": Enhancing Speech Image Retrieval via Target Speaker Extraction
by: Yang, Wenhao, et al.
Published: (2025) -
Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention
by: Salaj, Ina, et al.
Published: (2025)