Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.08421 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866916566763831296 |
|---|---|
| author | Kumar, Anurag Paturi, Rohit Afshan, Amber Srinivasan, Sundararajan |
| author_facet | Kumar, Anurag Paturi, Rohit Afshan, Amber Srinivasan, Sundararajan |
| contents | Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2501_08421 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models Kumar, Anurag Paturi, Rohit Afshan, Amber Srinivasan, Sundararajan Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning Sound Speaker Diarization (SD) is a crucial component of modern end-to-end ASR pipelines. Traditional SD systems, which are typically audio-based and operate independently of ASR, often introduce speaker errors, particularly during speaker transitions and overlapping speech. Recently, language models including fine-tuned large language models (LLMs) have shown to be effective as a second-pass speaker error corrector by leveraging lexical context in the transcribed output. In this work, we introduce a novel acoustic conditioning approach to provide more fine-grained information from the acoustic diarizer to the LLM. We also show that a simpler constrained decoding strategy reduces LLM hallucinations, while avoiding complicated post-processing. Our approach significantly reduces the speaker error rates by 24-43% across Fisher, Callhome, and RT03-CTS datasets, compared to the first-pass Acoustic SD. |
| title | SEAL: Speaker Error Correction using Acoustic-conditioned Large Language Models |
| topic | Audio and Speech Processing Artificial Intelligence Computation and Language Machine Learning Sound |
| url | https://arxiv.org/abs/2501.08421 |