Saved in:
Bibliographic Details
Main Authors: Lin, Chia-Fu, Tseng, Yi-Ju
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2603.07249
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911495816740864
author Lin, Chia-Fu
Tseng, Yi-Ju
author_facet Lin, Chia-Fu
Tseng, Yi-Ju
contents Second primary cancer (SPC), a new cancer in patients different from previously diagnosed, is a growing concern due to improved cancer survival rates. Early prediction of SPC is essential to enable timely clinical interventions. This study focuses on lung cancer survivors treated in Taiwanese hospitals, where the limited size and geographic scope of local datasets restrict the effectiveness and generalizability of traditional machine learning approaches. To address this, we incorporate external data from the publicly available US-based Surveillance, Epidemiology, and End Results (SEER) program, significantly increasing data diversity and scale. However, the integration of multi-source datasets presents challenges such as feature inconsistency and privacy constraints. Rather than naively merging data, we proposed a loss fusion horizontal federated learning (LF2L) framework that can enable effective cross-institutional collaboration while preserving institutional privacy by avoiding data sharing. Using both common and unique features and balancing their contributions through a shared loss mechanism, our method demonstrates substantial improvements in the prediction performance of SPC. Experiment results show statistically significant improvements in AUROC and AUPRC when compared to localized, horizontal federated, and centralized learning baselines. This highlights the importance of not only acquiring external data but also leveraging it effectively to enhance model performance in real-world clinical model development.
format Preprint
id arxiv_https___arxiv_org_abs_2603_07249
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle LF2L: Loss Fusion Horizontal Federated Learning Across Heterogeneous Feature Spaces Using External Datasets Effectively: A Case Study in Second Primary Cancer Prediction
Lin, Chia-Fu
Tseng, Yi-Ju
Machine Learning
Second primary cancer (SPC), a new cancer in patients different from previously diagnosed, is a growing concern due to improved cancer survival rates. Early prediction of SPC is essential to enable timely clinical interventions. This study focuses on lung cancer survivors treated in Taiwanese hospitals, where the limited size and geographic scope of local datasets restrict the effectiveness and generalizability of traditional machine learning approaches. To address this, we incorporate external data from the publicly available US-based Surveillance, Epidemiology, and End Results (SEER) program, significantly increasing data diversity and scale. However, the integration of multi-source datasets presents challenges such as feature inconsistency and privacy constraints. Rather than naively merging data, we proposed a loss fusion horizontal federated learning (LF2L) framework that can enable effective cross-institutional collaboration while preserving institutional privacy by avoiding data sharing. Using both common and unique features and balancing their contributions through a shared loss mechanism, our method demonstrates substantial improvements in the prediction performance of SPC. Experiment results show statistically significant improvements in AUROC and AUPRC when compared to localized, horizontal federated, and centralized learning baselines. This highlights the importance of not only acquiring external data but also leveraging it effectively to enhance model performance in real-world clinical model development.
title LF2L: Loss Fusion Horizontal Federated Learning Across Heterogeneous Feature Spaces Using External Datasets Effectively: A Case Study in Second Primary Cancer Prediction
topic Machine Learning
url https://arxiv.org/abs/2603.07249