:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Rashid, Maisha Binte, Rivas, Pablo
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Audio and Speech Processing I.2.7
Online Access:	https://arxiv.org/abs/2407.21174
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Leveraging OpenFlamingo for Multimodal Embedding Analysis of C2C Car Parts Data
by: Rashid, Maisha Binte, et al.
Published: (2025)

Dynamic Behaviour of Connectionist Speech Recognition with Strong Latency Constraints
by: Salvi, Giampiero
Published: (2024)

MGSC: A Multi-granularity Consistency Framework for Robust End-to-end Asr
by: Yang, Xuwen
Published: (2025)

A Baseline Multimodal Approach to Emotion Recognition in Conversations
by: Yeste, Víctor, et al.
Published: (2026)

Can We Achieve High-quality Direct Speech-to-Speech Translation without Parallel Speech Data?
by: Fang, Qingkai, et al.
Published: (2024)

CTC-based Non-autoregressive Textless Speech-to-Speech Translation
by: Fang, Qingkai, et al.
Published: (2024)

MIRFLEX: Music Information Retrieval Feature Library for Extraction
by: Chopra, Anuradha, et al.
Published: (2024)

LLaMA-Omni: Seamless Speech Interaction with Large Language Models
by: Fang, Qingkai, et al.
Published: (2024)

Effectiveness of Text, Acoustic, and Lattice-based representations in Spoken Language Understanding tasks
by: Villatoro-Tello, Esaú, et al.
Published: (2022)

Community-Informed AI Models for Police Accountability
by: Graham, Benjamin A. T., et al.
Published: (2024)

FC-TTS: Style and Timbre Control in Zero-Shot Text-to-Speech with Disentangled Speech Representations
by: Lee, Yoonhyung, et al.
Published: (2026)

Deep Dubbing: End-to-End Auto-Audiobook System with Text-to-Timbre and Context-Aware Instruct-TTS
by: Dai, Ziqi, et al.
Published: (2025)

An open-source voice type classifier for child-centered daylong recordings
by: Lavechin, Marvin, et al.
Published: (2020)

A Robust Classification Method using Hybrid Word Embedding for Early Diagnosis of Alzheimer's Disease
by: Li, Yangyang
Published: (2025)

Can phones, syllables, and words emerge as side-products of cross-situational audiovisual learning? -- A computational investigation
by: Khorrami, Khazar, et al.
Published: (2021)

SFMS-ALR: Script-First Multilingual Speech Synthesis with Adaptive Locale Resolution
by: Donepudi, Dharma Teja
Published: (2025)

An End-to-End Approach for Korean Wakeword Systems with Speaker Authentication
by: Seo, Geonwoo
Published: (2025)

An accurate and revised version of optical character recognition-based speech synthesis using LabVIEW
by: Mehta, Prateek, et al.
Published: (2025)

SpeechAccentLLM: A Unified Framework for Foreign Accent Conversion and Text to Speech
by: Cheng, Zhuangfei, et al.
Published: (2025)

Quantifying the effect of speech pathology on automatic and human speaker verification
by: Halpern, Bence Mark, et al.
Published: (2024)

Exploring rhythm formant analysis for Indic language classification
by: Gogoi, Parismita, et al.
Published: (2024)

PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for Long-Term Expressive Symbolic Music Generation
by: Yi, Yungang, et al.
Published: (2024)

Impact of Frame Rates on Speech Tokenizer: A Case Study on Mandarin and English
by: Zhang, Haoyang, et al.
Published: (2025)

Framework for Curating Speech Datasets and Evaluating ASR Systems: A Case Study for Polish
by: Junczyk, Michał
Published: (2024)

Multimodal Emotion Recognition using Audio-Video Transformer Fusion with Cross Attention
by: R, Joe Dhanith P, et al.
Published: (2024)

Computational modeling of early language learning from acoustic speech and audiovisual input without linguistic priors
by: Räsänen, Okko
Published: (2026)

SW-ASR: A Context-Aware Hybrid ASR Pipeline for Robust Single Word Speech Recognition
by: Sharma, Manali, et al.
Published: (2026)

Beyond Levenshtein: Leveraging Multiple Algorithms for Robust Word Error Rate Computations And Granular Error Classifications
by: Kuhn, Korbinian, et al.
Published: (2024)

Impact of Phonetics on Speaker Identity in Adversarial Voice Attack
by: Dar, Daniyal Kabir, et al.
Published: (2025)

SynParaSpeech: Automated Synthesis of Paralinguistic Datasets for Speech Generation and Understanding
by: Bai, Bingsong, et al.
Published: (2025)

Towards Multi-Level Transcript Segmentation: LoRA Fine-Tuning for Table-of-Contents Generation
by: Freisinger, Steffen, et al.
Published: (2026)

U2++ MoE: Scaling 4.7x parameters with minimal impact on RTF
by: Song, Xingchen, et al.
Published: (2024)

Emotional Voice Messages (EMOVOME) database: emotion recognition in spontaneous voice messages
by: Zaragozá, Lucía Gómez, et al.
Published: (2024)

Developing Acoustic Models for Automatic Speech Recognition in Swedish
by: Salvi, Giampiero
Published: (2024)

SpeechWeave: Diverse Multilingual Synthetic Text & Audio Data Generation Pipeline for Training Text to Speech Models
by: Dua, Karan, et al.
Published: (2025)

Advancing Voice Cloning for Nepali: Leveraging Transfer Learning in a Low-Resource Language
by: Karki, Manjil, et al.
Published: (2024)

AQUALLM: Audio Question Answering Data Generation Using Large Language Models
by: Behera, Swarup Ranjan, et al.
Published: (2023)

TAIL: Text-Audio Incremental Learning
by: Sun, Yingfei, et al.
Published: (2025)

Learning Co-Speech Gesture Representations in Dialogue through Contrastive Learning: An Intrinsic Evaluation
by: Ghaleb, Esam, et al.
Published: (2024)

Make Some Noise: Towards LLM audio reasoning and generation using sound tokens
by: Mehta, Shivam, et al.
Published: (2025)