Saved in:
Bibliographic Details
Main Authors: Lucassen, Ruben T., van de Luijtgaarden, Tijn, Moonemans, Sander P. J., Breimer, Gerben E., Blokx, Willeke A. M., Veta, Mitko
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2502.19285
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909640216805376
author Lucassen, Ruben T.
van de Luijtgaarden, Tijn
Moonemans, Sander P. J.
Breimer, Gerben E.
Blokx, Willeke A. M.
Veta, Mitko
author_facet Lucassen, Ruben T.
van de Luijtgaarden, Tijn
Moonemans, Sander P. J.
Breimer, Gerben E.
Blokx, Willeke A. M.
Veta, Mitko
contents Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.
format Preprint
id arxiv_https___arxiv_org_abs_2502_19285
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
Lucassen, Ruben T.
van de Luijtgaarden, Tijn
Moonemans, Sander P. J.
Breimer, Gerben E.
Blokx, Willeke A. M.
Veta, Mitko
Computer Vision and Pattern Recognition
Vision-language models in pathology enable multimodal case retrieval and automated report generation. Many of the models developed so far, however, have been trained on pathology reports that include information which cannot be inferred from paired whole slide images (e.g., patient history), potentially leading to hallucinated sentences in generated reports. To this end, we investigate how the selection of information from pathology reports for vision-language modeling affects the quality of the multimodal representations and generated reports. More concretely, we compare a model trained on full reports against a model trained on preprocessed reports that only include sentences describing the cell and tissue appearances based on the H&E-stained slides. For the experiments, we built upon the BLIP-2 framework and used a cutaneous melanocytic lesion dataset of 42,433 H&E-stained whole slide images and 19,636 corresponding pathology reports. Model performance was assessed using image-to-text and text-to-image retrieval, as well as qualitative evaluation of the generated reports by an expert pathologist. Our results demonstrate that text preprocessing prevents hallucination in report generation. Despite the improvement in the quality of the generated reports, training the vision-language model on full reports showed better cross-modal retrieval performance.
title On the Importance of Text Preprocessing for Multimodal Representation Learning and Pathology Report Generation
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2502.19285