:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Eliav, Amit, Gannot, Sharon
Format:	Preprint
Published:	2024
Subjects:	Audio and Speech Processing Image and Video Processing
Online Access:	https://arxiv.org/abs/2407.01774
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Concurrent Speaker Detection: A multi-microphone Transformer-Based Approach
by: Eliav, Amit, et al.
Published: (2024)

Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges
by: Mingote, Victoria, et al.
Published: (2024)

Enhancing Real-World Active Speaker Detection with Multi-Modal Extraction Pre-Training
by: Tao, Ruijie, et al.
Published: (2024)

Listening for "You": Enhancing Speech Image Retrieval via Target Speaker Extraction
by: Yang, Wenhao, et al.
Published: (2025)

Attentive AV-FusionNet: Audio-Visual Quality Prediction with Hybrid Attention
by: Salaj, Ina, et al.
Published: (2025)

CoAVT: A Cognition-Inspired Unified Audio-Visual-Text Pre-Training Model for Multimodal Processing
by: Yue, Xianghu, et al.
Published: (2024)

UNQA: Unified No-Reference Quality Assessment for Audio, Image, Video, and Audio-Visual Content
by: Cao, Yuqin, et al.
Published: (2024)

Efficient Face Detection with Audio-Based Region Proposals for Human-Robot Interactions
by: Aris, William, et al.
Published: (2023)

KunquDB: An Attempt for Speaker Verification in the Chinese Opera Scenario
by: Zhou, Huali, et al.
Published: (2024)

SingIt! Singer Voice Transformation
by: Eliav, Amit, et al.
Published: (2024)

Language-Guided Contrastive Audio-Visual Masked Autoencoder with Automatically Generated Audio-Visual-Text Triplets from Videos
by: Ishikawa, Yuchi, et al.
Published: (2025)

Towards Language-Independent Face-Voice Association with Multimodal Foundation Models
by: Farhadipour, Aref, et al.
Published: (2025)

Event2Audio: Event-Based Optical Vibration Sensing
by: Cai, Mingxuan, et al.
Published: (2025)

Multimodal sensor fusion for real-time location-dependent defect detection in laser-directed energy deposition
by: Chen, Lequn, et al.
Published: (2023)

Listening without Looking: Modality Bias in Audio-Visual Captioning
by: Ishikawa, Yuchi, et al.
Published: (2025)

Speakers Localization Using Batch EM In Unfolding Neural Network
by: Veler, Rina, et al.
Published: (2026)

TACO: Training-free Sound Prompted Segmentation via Semantically Constrained Audio-visual CO-factorization
by: Malard, Hugo, et al.
Published: (2024)

DLIOS: An LLM-Augmented Real-Time Multi-Modal Interactive Enhancement Overlay System for Douyin Live Streaming
by: Wen, Shuide, et al.
Published: (2026)

BUT System Description for CHiME-9 MCoRec Challenge
by: Klement, Dominik, et al.
Published: (2026)

Bounds on Agreement between Subjective and Objective Measurements
by: Pieper, Jaden, et al.
Published: (2026)

Learning Video Temporal Dynamics with Cross-Modal Attention for Robust Audio-Visual Speech Recognition
by: Kim, Sungnyun, et al.
Published: (2024)

Leveraging Reverberation and Visual Depth Cues for Sound Event Localization and Detection with Distance Estimation
by: Berghi, Davide, et al.
Published: (2024)

Multimodal Marvels of Deep Learning in Medical Diagnosis: A Comprehensive Review of COVID-19 Detection
by: Islam, Md Shofiqul, et al.
Published: (2025)

A Smart-Glasses for Emergency Medical Services via Multimodal Multitask Learning
by: Jin, Liuyi, et al.
Published: (2025)

AKVSR: Audio Knowledge Empowered Visual Speech Recognition by Compressing Audio Knowledge of a Pretrained Model
by: Yeo, Jeong Hun, et al.
Published: (2023)

Localizing Audio-Visual Deepfakes via Hierarchical Boundary Modeling
by: Chen, Xuanjun, et al.
Published: (2025)

Cross Attentional Audio-Visual Fusion for Dimensional Emotion Recognition
by: Praveen, R. Gnana, et al.
Published: (2021)

Binaural Target Speaker Extraction using Individualized HRTF
by: Ellinson, Yoav, et al.
Published: (2025)

Bootstrapping Audiovisual Speech Recognition in Zero-AV-Resource Scenarios with Synthetic Visual Data
by: Buitrago, Pol, et al.
Published: (2026)

HRTF-guided Binaural Target Speaker Extraction with Real-World Validation
by: Ellinson, Yoav, et al.
Published: (2026)

Single-Microphone Speaker Separation and Voice Activity Detection in Noisy and Reverberant Environments
by: Opochinsky, Renana, et al.
Published: (2024)

Improvement Of Audiovisual Quality Estimation Using A Nonlinear Autoregressive Exogenous Neural Network And Bitstream Parameters
by: Kossi, Koffi, et al.
Published: (2024)

Interpretable Modeling of Articulatory Temporal Dynamics from real-time MRI for Phoneme Recognition
by: Park, Jay, et al.
Published: (2025)

The role of audio-visual integration in the time course of phonetic encoding in self-supervised speech models
by: Wang, Yi, et al.
Published: (2025)

Multimodal Biomarkers for Schizophrenia: Towards Individual Symptom Severity Estimation
by: Premananth, Gowtham, et al.
Published: (2025)

TAVID: Text-Driven Audio-Visual Interactive Dialogue Generation
by: Kim, Ji-Hoon, et al.
Published: (2025)

PrismAudio: Decomposed Chain-of-Thoughts and Multi-dimensional Rewards for Video-to-Audio Generation
by: Liu, Huadai, et al.
Published: (2025)

Spatial and Semantic Embedding Integration for Stereo Sound Event Localization and Detection in Regular Videos
by: Berghi, Davide, et al.
Published: (2025)

Linearly Constrained Deep Beamformer for Multi-Speaker Scenarios
by: Zaidel, Ilai, et al.
Published: (2026)

peerRTF: Robust MVDR Beamforming Using Graph Convolutional Network
by: Levi, Daniel, et al.
Published: (2024)