Saved in:
| Main Authors: | Rashid, Maisha Binte, Rivas, Pablo |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2407.21174 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Leveraging OpenFlamingo for Multimodal Embedding Analysis of C2C Car Parts Data
by: Rashid, Maisha Binte, et al.
Published: (2025)
by: Rashid, Maisha Binte, et al.
Published: (2025)
Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints
by: Salvi, Giampiero
Published: (2024)
by: Salvi, Giampiero
Published: (2024)
MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr
by: Yang, Xuwen
Published: (2025)
by: Yang, Xuwen
Published: (2025)
A Baseline Multimodal Approach to Emotion Recognition in Conversations
by: Yeste, Víctor, et al.
Published: (2026)
by: Yeste, Víctor, et al.
Published: (2026)
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
by: Fang, Qingkai, et al.
Published: (2024)
by: Fang, Qingkai, et al.
Published: (2024)
CTC-based Non-autoregressive Textless Speech-to-Speech Translation
by: Fang, Qingkai, et al.
Published: (2024)
by: Fang, Qingkai, et al.
Published: (2024)
MIRFLEX: Music Information Retrieval Feature Library for Extraction
by: Chopra, Anuradha, et al.
Published: (2024)
by: Chopra, Anuradha, et al.
Published: (2024)
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
by: Fang, Qingkai, et al.
Published: (2024)
by: Fang, Qingkai, et al.
Published: (2024)
Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks
by: Villatoro-Tello, Esaú, et al.
Published: (2022)
by: Villatoro-Tello, Esaú, et al.
Published: (2022)
Community-Informed AI Models for Police Accountability
by: Graham, Benjamin A. T., et al.
Published: (2024)
by: Graham, Benjamin A. T., et al.
Published: (2024)
FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations
by: Lee, Yoonhyung, et al.
Published: (2026)
by: Lee, Yoonhyung, et al.
Published: (2026)
Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
by: Dai, Ziqi, et al.
Published: (2025)
by: Dai, Ziqi, et al.
Published: (2025)
An open-source voice type classifier for child-centered daylong recordings
by: Lavechin, Marvin, et al.
Published: (2020)
by: Lavechin, Marvin, et al.
Published: (2020)
A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease
by: Li, Yangyang
Published: (2025)
by: Li, Yangyang
Published: (2025)
Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation
by: Khorrami, Khazar, et al.
Published: (2021)
by: Khorrami, Khazar, et al.
Published: (2021)
SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution
by: Donepudi, Dharma Teja
Published: (2025)
by: Donepudi, Dharma Teja
Published: (2025)
An End-to-End Approach for Korean Wakeword Systems with Speaker Authentication
by: Seo, Geonwoo
Published: (2025)
by: Seo, Geonwoo
Published: (2025)
An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW
by: Mehta, Prateek, et al.
Published: (2025)
by: Mehta, Prateek, et al.
Published: (2025)
SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech
by: Cheng, Zhuangfei, et al.
Published: (2025)
by: Cheng, Zhuangfei, et al.
Published: (2025)
Quantifying the effect of speech pathology on automatic and human speaker verification
by: Halpern, Bence Mark, et al.
Published: (2024)
by: Halpern, Bence Mark, et al.
Published: (2024)
Exploring rhythm formant analysis for Indic language classification
by: Gogoi, Parismita, et al.
Published: (2024)
by: Gogoi, Parismita, et al.
Published: (2024)
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation
by: Yi, Yungang, et al.
Published: (2024)
by: Yi, Yungang, et al.
Published: (2024)
Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
by: Zhang, Haoyang, et al.
Published: (2025)
by: Zhang, Haoyang, et al.
Published: (2025)
Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish
by: Junczyk, Michał
Published: (2024)
by: Junczyk, Michał
Published: (2024)
Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
by: R, Joe Dhanith P, et al.
Published: (2024)
by: R, Joe Dhanith P, et al.
Published: (2024)
Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors
by: Räsänen, Okko
Published: (2026)
by: Räsänen, Okko
Published: (2026)
SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition
by: Sharma, Manali, et al.
Published: (2026)
by: Sharma, Manali, et al.
Published: (2026)
Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications
by: Kuhn, Korbinian, et al.
Published: (2024)
by: Kuhn, Korbinian, et al.
Published: (2024)
Impact of Phonetics on Speaker Identity in Adversarial Voice Attack
by: Dar, Daniyal Kabir, et al.
Published: (2025)
by: Dar, Daniyal Kabir, et al.
Published: (2025)
SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
by: Bai, Bingsong, et al.
Published: (2025)
by: Bai, Bingsong, et al.
Published: (2025)
Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation
by: Freisinger, Steffen, et al.
Published: (2026)
by: Freisinger, Steffen, et al.
Published: (2026)
U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF
by: Song, Xingchen, et al.
Published: (2024)
by: Song, Xingchen, et al.
Published: (2024)
Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages
by: Zaragozá, Lucía Gómez, et al.
Published: (2024)
by: Zaragozá, Lucía Gómez, et al.
Published: (2024)
Developing Acoustic Models for Automatic Speech Recognition in Swedish
by: Salvi, Giampiero
Published: (2024)
by: Salvi, Giampiero
Published: (2024)
SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models
by: Dua, Karan, et al.
Published: (2025)
by: Dua, Karan, et al.
Published: (2025)
Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language
by: Karki, Manjil, et al.
Published: (2024)
by: Karki, Manjil, et al.
Published: (2024)
AQUALLM: Audio Question Answering Data Generation Using Large Language Models
by: Behera, Swarup Ranjan, et al.
Published: (2023)
by: Behera, Swarup Ranjan, et al.
Published: (2023)
TAIL: Text-Audio Incremental Learning
by: Sun, Yingfei, et al.
Published: (2025)
by: Sun, Yingfei, et al.
Published: (2025)
Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation
by: Ghaleb, Esam, et al.
Published: (2024)
by: Ghaleb, Esam, et al.
Published: (2024)
Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
by: Mehta, Shivam, et al.
Published: (2025)
by: Mehta, Shivam, et al.
Published: (2025)
Similar Items
-
Leveraging OpenFlamingo for Multimodal Embedding Analysis of C2C Car Parts Data
by: Rashid, Maisha Binte, et al.
Published: (2025) -
Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints
by: Salvi, Giampiero
Published: (2024) -
MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr
by: Yang, Xuwen
Published: (2025) -
A Baseline Multimodal Approach to Emotion Recognition in Conversations
by: Yeste, Víctor, et al.
Published: (2026) -
Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
by: Fang, Qingkai, et al.
Published: (2024)