Saved in:
Bibliographic Details
Main Authors: Li, Yulong, Zhang, Yuxuan, Tang, Feilong, Hu, Ming, Lu, Zhixiang, Xue, Haochen, Wu, Jianghao, Zhou, Mian, Dang, Kang, Li, Chong, Wang, Yifang, Razzak, Imran, Su, Jionglong
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2501.00765
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866917983492767744
author Li, Yulong
Zhang, Yuxuan
Tang, Feilong
Hu, Ming
Lu, Zhixiang
Xue, Haochen
Wu, Jianghao
Zhou, Mian
Dang, Kang
Li, Chong
Wang, Yifang
Razzak, Imran
Su, Jionglong
author_facet Li, Yulong
Zhang, Yuxuan
Tang, Feilong
Hu, Ming
Lu, Zhixiang
Xue, Haochen
Wu, Jianghao
Zhou, Mian
Dang, Kang
Li, Chong
Wang, Yifang
Razzak, Imran
Su, Jionglong
contents Sign language is the primary communication mode for 72 million hearing-impaired individuals worldwide, necessitating effective bidirectional Sign Language Production and Sign Language Translation systems. However, functional bidirectional systems require a unified linguistic environment, hindered by the lack of suitable unified datasets, particularly those providing the necessary pose information for accurate Sign Language Production (SLP) evaluation. Concurrently, current SLP evaluation methods like back-translation ignore pose accuracy, and high-quality coordinated generation remains challenging. To create this crucial environment and overcome these challenges, we introduce CNText2Sign and CNSign, which together constitute the first unified dataset aimed at supporting bidirectional accessibility systems for Chinese sign language; CNText2Sign provides 15,000 natural language-to-sign mappings and standardized skeletal keypoints for 8,643 vocabulary items supporting pose assessment. Building upon this foundation, we propose the AuraLLM model, which leverages a decoupled architecture with CNText2Sign's pose data for novel direct gesture accuracy assessment. The model employs retrieval augmentation and Cascading Vocabulary Resolution to handle semantic mapping and out-of-vocabulary words and achieves all-scenario production with controllable coordination of gestures and facial expressions via pose-conditioned video synthesis. Concurrently, our Sign Language Translation model SignMST-C employs targeted self-supervised pretraining for dynamic feature capture, achieving new SOTA results on PHOENIX2014-T with BLEU-4 scores up to 32.08. AuraLLM establishes a strong performance baseline on CNText2Sign with a BLEU-4 score of 50.41 under direct evaluation.
format Preprint
id arxiv_https___arxiv_org_abs_2501_00765
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle Beyond Words: AuralLLM and SignMST-C for Sign Language Production and Bidirectional Accessibility
Li, Yulong
Zhang, Yuxuan
Tang, Feilong
Hu, Ming
Lu, Zhixiang
Xue, Haochen
Wu, Jianghao
Zhou, Mian
Dang, Kang
Li, Chong
Wang, Yifang
Razzak, Imran
Su, Jionglong
Computer Vision and Pattern Recognition
Machine Learning
Sign language is the primary communication mode for 72 million hearing-impaired individuals worldwide, necessitating effective bidirectional Sign Language Production and Sign Language Translation systems. However, functional bidirectional systems require a unified linguistic environment, hindered by the lack of suitable unified datasets, particularly those providing the necessary pose information for accurate Sign Language Production (SLP) evaluation. Concurrently, current SLP evaluation methods like back-translation ignore pose accuracy, and high-quality coordinated generation remains challenging. To create this crucial environment and overcome these challenges, we introduce CNText2Sign and CNSign, which together constitute the first unified dataset aimed at supporting bidirectional accessibility systems for Chinese sign language; CNText2Sign provides 15,000 natural language-to-sign mappings and standardized skeletal keypoints for 8,643 vocabulary items supporting pose assessment. Building upon this foundation, we propose the AuraLLM model, which leverages a decoupled architecture with CNText2Sign's pose data for novel direct gesture accuracy assessment. The model employs retrieval augmentation and Cascading Vocabulary Resolution to handle semantic mapping and out-of-vocabulary words and achieves all-scenario production with controllable coordination of gestures and facial expressions via pose-conditioned video synthesis. Concurrently, our Sign Language Translation model SignMST-C employs targeted self-supervised pretraining for dynamic feature capture, achieving new SOTA results on PHOENIX2014-T with BLEU-4 scores up to 32.08. AuraLLM establishes a strong performance baseline on CNText2Sign with a BLEU-4 score of 50.41 under direct evaluation.
title Beyond Words: AuralLLM and SignMST-C for Sign Language Production and Bidirectional Accessibility
topic Computer Vision and Pattern Recognition
Machine Learning
url https://arxiv.org/abs/2501.00765