Saved in:
Bibliographic Details
Main Authors: Lu, Yichen, Dai, Wei, Liu, Jiaen, Kwok, Ching Wing, Wu, Zongheng, Xiao, Xudong, Sun, Ao, Fu, Sheng, Zhan, Jianyuan, Wang, Yian, Saito, Takatomo, Lai, Sicheng
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2507.07306
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866913935050932224
author Lu, Yichen
Dai, Wei
Liu, Jiaen
Kwok, Ching Wing
Wu, Zongheng
Xiao, Xudong
Sun, Ao
Fu, Sheng
Zhan, Jianyuan
Wang, Yian
Saito, Takatomo
Lai, Sicheng
author_facet Lu, Yichen
Dai, Wei
Liu, Jiaen
Kwok, Ching Wing
Wu, Zongheng
Xiao, Xudong
Sun, Ao
Fu, Sheng
Zhan, Jianyuan
Wang, Yian
Saito, Takatomo
Lai, Sicheng
contents LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove
format Preprint
id arxiv_https___arxiv_org_abs_2507_07306
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
Lu, Yichen
Dai, Wei
Liu, Jiaen
Kwok, Ching Wing
Wu, Zongheng
Xiao, Xudong
Sun, Ao
Fu, Sheng
Zhan, Jianyuan
Wang, Yian
Saito, Takatomo
Lai, Sicheng
Artificial Intelligence
Computation and Language
Audio and Speech Processing
LLM-based translation agents have achieved highly human-like translation results and are capable of handling longer and more complex contexts with greater efficiency. However, they are typically limited to text-only inputs. In this paper, we introduce ViDove, a translation agent system designed for multimodal input. Inspired by the workflow of human translators, ViDove leverages visual and contextual background information to enhance the translation process. Additionally, we integrate a multimodal memory system and long-short term memory modules enriched with domain-specific knowledge, enabling the agent to perform more accurately and adaptively in real-world scenarios. As a result, ViDove achieves significantly higher translation quality in both subtitle generation and general translation tasks, with a 28% improvement in BLEU scores and a 15% improvement in SubER compared to previous state-of-the-art baselines. Moreover, we introduce DoveBench, a new benchmark for long-form automatic video subtitling and translation, featuring 17 hours of high-quality, human-annotated data. Our code is available here: https://github.com/pigeonai-org/ViDove
title ViDove: A Translation Agent System with Multimodal Context and Memory-Augmented Reasoning
topic Artificial Intelligence
Computation and Language
Audio and Speech Processing
url https://arxiv.org/abs/2507.07306