Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.20116 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866914411671715840 |
|---|---|
| author | Li, Jiajie Xu, Chenhui Liu, Meihuan Xiong, Jinjun |
| author_facet | Li, Jiajie Xu, Chenhui Liu, Meihuan Xiong, Jinjun |
| contents | Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2603_20116 |
| institution | arXiv |
| publishDate | 2026 |
| record_format | arxiv |
| spellingShingle | Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning Li, Jiajie Xu, Chenhui Liu, Meihuan Xiong, Jinjun Computer Vision and Pattern Recognition Artificial Intelligence Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs. |
| title | Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence |
| url | https://arxiv.org/abs/2603.20116 |