Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Baur, Simon, Benova, Alexandra, Cantú, Emilio Dolgener, Ma, Jackie
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Machine Learning
Online Access:	https://arxiv.org/abs/2508.06558
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866908483419373568
author	Baur, Simon Benova, Alexandra Cantú, Emilio Dolgener Ma, Jackie
author_facet	Baur, Simon Benova, Alexandra Cantú, Emilio Dolgener Ma, Jackie
contents	Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.
format	Preprint
id	arxiv_https___arxiv_org_abs_2508_06558
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications Baur, Simon Benova, Alexandra Cantú, Emilio Dolgener Ma, Jackie Computer Vision and Pattern Recognition Machine Learning Deploying deep learning models in clinical practice often requires leveraging multiple data modalities, such as images, text, and structured data, to achieve robust and trustworthy decisions. However, not all modalities are always available at inference time. In this work, we propose multimodal privileged knowledge distillation (MMPKD), a training strategy that utilizes additional modalities available solely during training to guide a unimodal vision model. Specifically, we used a text-based teacher model for chest radiographs (MIMIC-CXR) and a tabular metadata-based teacher model for mammography (CBIS-DDSM) to distill knowledge into a vision transformer student model. We show that MMPKD can improve the resulting attention maps' zero-shot capabilities of localizing ROI in input images, while this effect does not generalize across domains, as contrarily suggested by prior research.
title	On the effectiveness of multimodal privileged knowledge distillation in two vision transformer based diagnostic applications
topic	Computer Vision and Pattern Recognition Machine Learning
url	https://arxiv.org/abs/2508.06558

Similar Items