Saved in:
| Main Authors: | , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.15052 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866915607694278656 |
|---|---|
| author | Betala, Siddharth Chokshi, Ishan |
| author_facet | Betala, Siddharth Chokshi, Ishan |
| contents | In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2409_15052 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning Betala, Siddharth Chokshi, Ishan Computation and Language Artificial Intelligence In this paper, we describe our system under the team name Brotherhood for the English-to-Lowres Multi-Modal Translation Task. We participate in the multi-modal translation tasks for English-Hindi, English-Hausa, English-Bengali, and English-Malayalam language pairs. We present a method leveraging multi-modal Large Language Models (LLMs), specifically GPT-4o and Claude 3.5 Sonnet, to enhance cross-lingual image captioning without traditional training or fine-tuning. Our approach utilizes instruction-tuned prompting to generate rich, contextual conversations about cropped images, using their English captions as additional context. These synthetic conversations are then translated into the target languages. Finally, we employ a weighted prompting strategy, balancing the original English caption with the translated conversation to generate captions in the target language. This method achieved competitive results, scoring 37.90 BLEU on the English-Hindi Challenge Set and ranking first and second for English-Hausa on the Challenge and Evaluation Leaderboards, respectively. We conduct additional experiments on a subset of 250 images, exploring the trade-offs between BLEU scores and semantic similarity across various weighting schemes. |
| title | Brotherhood at WMT 2024: Leveraging LLM-Generated Contextual Conversations for Cross-Lingual Image Captioning |
| topic | Computation and Language Artificial Intelligence |
| url | https://arxiv.org/abs/2409.15052 |