Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Kim, Youngmin, Choo, Kyobin, Park, Jiwoo, Kim, Minseo, Kim, Chanyoung, Kim, Junhyeok, Hwang, Seong Jae
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.14705
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866916012618678272
author	Kim, Youngmin Choo, Kyobin Park, Jiwoo Kim, Minseo Kim, Chanyoung Kim, Junhyeok Hwang, Seong Jae
author_facet	Kim, Youngmin Choo, Kyobin Park, Jiwoo Kim, Minseo Kim, Chanyoung Kim, Junhyeok Hwang, Seong Jae
contents	Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_14705
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Towards Continuous Sign Language Conversation from Isolated Signs Kim, Youngmin Choo, Kyobin Park, Jiwoo Kim, Minseo Kim, Chanyoung Kim, Junhyeok Hwang, Seong Jae Computer Vision and Pattern Recognition Sign language is the primary language for many Deaf and Hard-of-Hearing (DHH) signers, yet most conversational AI systems still mediate interaction through spoken or written language. This spoken-language-centered interface can limit access for signers for whom spoken or written language is not the most accessible medium, motivating direct sign-to-sign conversational modeling. However, sentence-level sign video data are expensive to collect and annotate, leaving existing sign translation and production models with limited vocabulary coverage and weak open-domain generalization. We address this bottleneck by constructing continuous sign conversations from isolated signs: large-scale labeled isolated clips are collected as lexically grounded motion primitives and recomposed into sign-language-ordered utterances derived from existing dialogue corpora. We introduce SignaVox-W, which provides, to our knowledge, the largest labeled isolated-sign vocabulary to date, and SignaVox-U, a continuous 3D sign conversation dataset built from SignaVox-W. To bridge structural mismatch between spoken and signed languages, we use a retrieval-guided spoken-to-gloss translator; to bridge independently collected isolated clips, we propose BRAID, a diffusion Transformer that performs duration alignment and co-articulatory boundary inpainting. With the resulting data, we train SignaVox, a direct sign-to-sign conversational model that generates 3D body, hand, and facial motion responses from prior signing context without spoken-language text or externally provided glosses at inference time. Quantitative and qualitative evaluations show improved isolated-to-continuous motion quality, stronger response-level semantic alignment, and scalable signer-centered interaction that better supports visual-spatial articulation.
title	Towards Continuous Sign Language Conversation from Isolated Signs
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.14705

Similar Items