Saved in:
Bibliographic Details
Main Authors: She, Chengying, Chen, Chengwei, Zhang, Xinran, Wang, Ben, Liu, Lizhuang, Shao, Chengwei, Bian, Yun
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2601.20347
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866912874397433856
author She, Chengying
Chen, Chengwei
Zhang, Xinran
Wang, Ben
Liu, Lizhuang
Shao, Chengwei
Bian, Yun
author_facet She, Chengying
Chen, Chengwei
Zhang, Xinran
Wang, Ben
Liu, Lizhuang
Shao, Chengwei
Bian, Yun
contents Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1--6.6\% accuracy and 2.2--6.9\% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1--9.8\% C-index improvements compared with unimodal methods and 5.6--7.1\% over multimodal alternatives.
format Preprint
id arxiv_https___arxiv_org_abs_2601_20347
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis
She, Chengying
Chen, Chengwei
Zhang, Xinran
Wang, Ben
Liu, Lizhuang
Shao, Chengwei
Bian, Yun
Computer Vision and Pattern Recognition
Multimodal evidence is critical in computational pathology: gigapixel whole slide images capture tumor morphology, while patient-level clinical descriptors preserve complementary context for prognosis. Integrating such heterogeneous signals remains challenging because feature spaces exhibit distinct statistics and scales. We introduce MMSF, a multitask and multimodal supervised framework built on a linear-complexity MIL backbone that explicitly decomposes and fuses cross-modal information. MMSF comprises a graph feature extraction module embedding tissue topology at the patch level, a clinical data embedding module standardizing patient attributes, a feature fusion module aligning modality-shared and modality-specific representations, and a Mamba-based MIL encoder with multitask prediction heads. Experiments on CAMELYON16 and TCGA-NSCLC demonstrate 2.1--6.6\% accuracy and 2.2--6.9\% AUC improvements over competitive baselines, while evaluations on five TCGA survival cohorts yield 7.1--9.8\% C-index improvements compared with unimodal methods and 5.6--7.1\% over multimodal alternatives.
title MMSF: Multitask and Multimodal Supervised Framework for WSI Classification and Survival Analysis
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2601.20347