Saved in:
Bibliographic Details
Main Authors: Nützel, Felix, Dombrowski, Mischa, Kainz, Bernhard
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2512.01675
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866918495159058432
author Nützel, Felix
Dombrowski, Mischa
Kainz, Bernhard
author_facet Nützel, Felix
Dombrowski, Mischa
Kainz, Bernhard
contents Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.
format Preprint
id arxiv_https___arxiv_org_abs_2512_01675
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle GRASP: Guided Residual Adapters with Sample-wise Partitioning
Nützel, Felix
Dombrowski, Mischa
Kainz, Bernhard
Computer Vision and Pattern Recognition
Text-to-image flow matching transformers degrade sharply in long-tail settings: tail-class outputs collapse in fidelity and diversity, limiting their value as synthetic augmentation for rare conditions. We trace this to low head-versus-tail gradient alignment during fine-tuning, an optimization-level pathology that conditioning- and sampling-side interventions do not address. We propose GRASP (Guided Residual Adapters with Sample-wise Partitioning): a deterministic partition of the conditioning space, paired with group-specific residual adapters in the transformer feedforward layers, that leaves the flow-matching objective and the sampler untouched. In conditional flow matching, condition values index distinct sets of probability paths, so partitioning along the conditioning is the structurally correct factorization suitable as gradient alignment proxy. Because the partition is static, every tail sample is guaranteed to update its assigned expert, which bypasses extreme longtail failure modes. Crucially, GRASP is non-invasive and composable: on MIMIC-CXR-LT, combining GRASP with self-guided minority sampling at inference time yields the best all-labels IRS we observe, beyond either intervention alone. GRASP itself reduces overall FID by up to 80\% and lifts tail-class coverage by up to 44\% over full fine-tuning, learned-routing MoE, and minority guidance. Used as training data for a downstream DenseNet classifier on NIH-CXR-LT, GRASP synthetics significantly outperform every non-GRASP alternative on macro F1, match the macro F1 obtained from real training data, and yield nonzero F1 on $9$ of $13$ classes versus $3$ of $13$ from full fine-tuning. Results on ImageNet-LT confirm the mechanism is not tied to medical inductive bias.
title GRASP: Guided Residual Adapters with Sample-wise Partitioning
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2512.01675