Saved in:
| Main Authors: | Clarke, Jason, Gotoh, Yoshihiko, Goetze, Stefan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.06012 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025)
by: Clarke, Jason, et al.
Published: (2025)
Ensembling Synchronisation-based and Face-Voice Association Paradigms for Robust Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025)
by: Clarke, Jason, et al.
Published: (2025)
EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities
by: Qian, Xinyuan, et al.
Published: (2026)
by: Qian, Xinyuan, et al.
Published: (2026)
Robust Active Speaker Detection in Noisy Environments
by: Vasireddy, Siva Sai Nagender, et al.
Published: (2024)
by: Vasireddy, Siva Sai Nagender, et al.
Published: (2024)
A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR
by: Morrone, Giovanni, et al.
Published: (2024)
by: Morrone, Giovanni, et al.
Published: (2024)
SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning
by: Nam, KiHyun, et al.
Published: (2026)
by: Nam, KiHyun, et al.
Published: (2026)
Dual-Stream Decoupled Learning for Temporal Consistency and Speaker Interaction in AVSD
by: Xiao, Junhao, et al.
Published: (2025)
by: Xiao, Junhao, et al.
Published: (2025)
Who is Authentic Speaker
by: Huang, Qiang
Published: (2024)
by: Huang, Qiang
Published: (2024)
CueNet: Robust Audio-Visual Speaker Extraction through Cross-Modal Cue Mining and Interaction
by: Wang, Jiadong, et al.
Published: (2026)
by: Wang, Jiadong, et al.
Published: (2026)
StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation
by: Yang, An, et al.
Published: (2025)
by: Yang, An, et al.
Published: (2025)
Noise-Conditioned Mixture-of-Experts Framework for Robust Speaker Verification
by: Gu, Bin, et al.
Published: (2025)
by: Gu, Bin, et al.
Published: (2025)
Audio-Visual Speaker Tracking: Progress, Challenges, and Future Directions
by: Zhao, Jinzheng, et al.
Published: (2023)
by: Zhao, Jinzheng, et al.
Published: (2023)
AFL-Net: Integrating Audio, Facial, and Lip Modalities with a Two-step Cross-attention for Robust Speaker Diarization in the Wild
by: Yin, Yongkang, et al.
Published: (2023)
by: Yin, Yongkang, et al.
Published: (2023)
Quality-Aware End-to-End Audio-Visual Neural Speaker Diarization
by: He, Mao-Kui, et al.
Published: (2024)
by: He, Mao-Kui, et al.
Published: (2024)
MeMo: Attentional Momentum for Real-time Audio-visual Speaker Extraction under Impaired Visual Conditions
by: Li, Junjie, et al.
Published: (2025)
by: Li, Junjie, et al.
Published: (2025)
Plug-and-Steer: Decoupling Separation and Selection in Audio-Visual Target Speaker Extraction
by: Kwak, Doyeop, et al.
Published: (2026)
by: Kwak, Doyeop, et al.
Published: (2026)
pTSE-T: Presentation Target Speaker Extraction using Unaligned Text Cues
by: Jiang, Ziyang, et al.
Published: (2024)
by: Jiang, Ziyang, et al.
Published: (2024)
Referee: Reference-aware Audiovisual Deepfake Detection
by: Boo, Hyemin, et al.
Published: (2025)
by: Boo, Hyemin, et al.
Published: (2025)
LASPA: Language Agnostic Speaker Disentanglement with Prefix-Tuned Cross-Attention
by: Menon, Aditya Srinivas, et al.
Published: (2025)
by: Menon, Aditya Srinivas, et al.
Published: (2025)
M3SD: Multi-modal, Multi-scenario and Multi-language Speaker Diarization Dataset
by: Wu, Shilong
Published: (2025)
by: Wu, Shilong
Published: (2025)
REWIND: Speech Time Reversal for Enhancing Speaker Representations in Diffusion-based Voice Conversion
by: Biyani, Ishan D., et al.
Published: (2025)
by: Biyani, Ishan D., et al.
Published: (2025)
Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
by: Li, Yingxuan, et al.
Published: (2024)
by: Li, Yingxuan, et al.
Published: (2024)
Preserving Speaker Information in Direct Speech-to-Speech Translation with Non-Autoregressive Generation and Pretraining
by: Zhou, Rui, et al.
Published: (2024)
by: Zhou, Rui, et al.
Published: (2024)
Spectron: Target Speaker Extraction using Conditional Transformer with Adversarial Refinement
by: Bandyopadhyay, Tathagata
Published: (2024)
by: Bandyopadhyay, Tathagata
Published: (2024)
Landmark-Guided Cross-Speaker Lip Reading with Mutual Information Regularization
by: Wu, Linzhi, et al.
Published: (2024)
by: Wu, Linzhi, et al.
Published: (2024)
Audio-Visual Speaker Diarization: Current Databases, Approaches and Challenges
by: Mingote, Victoria, et al.
Published: (2024)
by: Mingote, Victoria, et al.
Published: (2024)
RAVSS: Robust Audio-Visual Speech Separation in Multi-Speaker Scenarios with Missing Visual Cues
by: Pan, Tianrui, et al.
Published: (2024)
by: Pan, Tianrui, et al.
Published: (2024)
GaussianTalker: Speaker-specific Talking Head Synthesis via 3D Gaussian Splatting
by: Yu, Hongyun, et al.
Published: (2024)
by: Yu, Hongyun, et al.
Published: (2024)
"The Intangible Victory", Interactive Audiovisual Installation
by: Tsioutas, Konstantinos, et al.
Published: (2026)
by: Tsioutas, Konstantinos, et al.
Published: (2026)
InaGVAD : a Challenging French TV and Radio Corpus Annotated for Speech Activity Detection and Speaker Gender Segmentation
by: Doukhan, David, et al.
Published: (2024)
by: Doukhan, David, et al.
Published: (2024)
FLARE: Full-Modality Long-Video Audiovisual Retrieval Benchmark with User-Simulated Queries
by: You, Qijie, et al.
Published: (2026)
by: You, Qijie, et al.
Published: (2026)
$C^2$AV-TSE: Context and Confidence-aware Audio Visual Target Speaker Extraction
by: Wu, Wenxuan, et al.
Published: (2025)
by: Wu, Wenxuan, et al.
Published: (2025)
G-STAR: End-to-End Global Speaker-Tracking Attributed Recognition
by: Peng, Jing, et al.
Published: (2026)
by: Peng, Jing, et al.
Published: (2026)
CLAIP-Emo: Parameter-Efficient Adaptation of Language-supervised models for In-the-Wild Audiovisual Emotion Recognition
by: Chen, Yin, et al.
Published: (2025)
by: Chen, Yin, et al.
Published: (2025)
ReactMotion: Generating Reactive Listener Motions from Speaker Utterance
by: Luo, Cheng, et al.
Published: (2026)
by: Luo, Cheng, et al.
Published: (2026)
Information Need in Metaverse Recordings -- A Field Study
by: Steinert, Patrick, et al.
Published: (2024)
by: Steinert, Patrick, et al.
Published: (2024)
VoxAging: Continuously Tracking Speaker Aging with a Large-Scale Longitudinal Dataset in English and Mandarin
by: Ai, Zhiqi, et al.
Published: (2025)
by: Ai, Zhiqi, et al.
Published: (2025)
AMuSE: Adaptive Multimodal Analysis for Speaker Emotion Recognition in Group Conversations
by: Devulapally, Naresh Kumar, et al.
Published: (2024)
by: Devulapally, Naresh Kumar, et al.
Published: (2024)
TopoCode: Topologically Informed Error Detection and Correction in Communication Systems
by: Guo, Hongzhi
Published: (2024)
by: Guo, Hongzhi
Published: (2024)
MOMENTA: Mixture-of-Experts Over Multimodal Embeddings with Neural Temporal Aggregation for Misinformation Detection
by: Abdollahinejad, Yeganeh, et al.
Published: (2026)
by: Abdollahinejad, Yeganeh, et al.
Published: (2026)
Similar Items
-
Face-Voice Association for Audiovisual Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025) -
Ensembling Synchronisation-based and Face-Voice Association Paradigms for Robust Active Speaker Detection in Egocentric Recordings
by: Clarke, Jason, et al.
Published: (2025) -
EgoAdapt: Enhancing Robustness in Egocentric Interactive Speaker Detection Under Missing Modalities
by: Qian, Xinyuan, et al.
Published: (2026) -
Robust Active Speaker Detection in Noisy Environments
by: Vasireddy, Siva Sai Nagender, et al.
Published: (2024) -
A Toolkit for Joint Speaker Diarization and Identification with Application to Speaker-Attributed ASR
by: Morrone, Giovanni, et al.
Published: (2024)