Guardado en:
Detalles Bibliográficos
Autores principales: Chen, Qi, Ding, Shuhan, Gu, Yu, Liu, Nan, Bian, Jiang, Yuille, Alan, Zhou, Zongwei, Fu, Jingjing
Formato: Preprint
Publicado: 2026
Materias:
Acceso en línea:https://arxiv.org/abs/2605.30893
Etiquetas: Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
_version_ 1866913172240203776
author Chen, Qi
Ding, Shuhan
Gu, Yu
Liu, Nan
Bian, Jiang
Yuille, Alan
Zhou, Zongwei
Fu, Jingjing
author_facet Chen, Qi
Ding, Shuhan
Gu, Yu
Liu, Nan
Bian, Jiang
Yuille, Alan
Zhou, Zongwei
Fu, Jingjing
contents Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.
format Preprint
id arxiv_https___arxiv_org_abs_2605_30893
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation
Chen, Qi
Ding, Shuhan
Gu, Yu
Liu, Nan
Bian, Jiang
Yuille, Alan
Zhou, Zongwei
Fu, Jingjing
Computer Vision and Pattern Recognition
Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.
title Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2605.30893