Saved in:
Bibliographic Details
Main Authors: Li, Jiajie, Xu, Chenhui, Liu, Meihuan, Xiong, Jinjun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.20116
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914411671715840
author Li, Jiajie
Xu, Chenhui
Liu, Meihuan
Xiong, Jinjun
author_facet Li, Jiajie
Xu, Chenhui
Liu, Meihuan
Xiong, Jinjun
contents Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
format Preprint
id arxiv_https___arxiv_org_abs_2603_20116
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
Li, Jiajie
Xu, Chenhui
Liu, Meihuan
Xiong, Jinjun
Computer Vision and Pattern Recognition
Artificial Intelligence
Conventional fine-tuning on domain-specific datasets can inadvertently alter a model's pretrained multimodal priors, leading to reduced generalization. To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model's inherent reasoning and perceptual capabilities. CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning. Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning. Furthermore, ablation studies confirm that CoA effectively preserves the model's core visual-language abilities, providing a reliable pathway for domain specialization in VLMs.
title Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning
topic Computer Vision and Pattern Recognition
Artificial Intelligence
url https://arxiv.org/abs/2603.20116