Saved in:
Bibliographic Details
Main Authors: Jiang, Yuncheng, Zhang, Zixun, Wei, Jun, Feng, Chun-Mei, Li, Guanbin, Wan, Xiang, Cui, Shuguang, Li, Zhen
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2408.14051
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • AI-assisted lesion detection models play a crucial role in the early screening of cancer. However, previous image-based models ignore the inter-frame contextual information present in videos. On the other hand, video-based models capture the inter-frame context but are computationally expensive. To mitigate this contradiction, we delve into Video-to-Image knowledge distillation leveraging DEtection TRansformer (V2I-DETR) for the task of medical video lesion detection. V2I-DETR adopts a teacher-student network paradigm. The teacher network aims at extracting temporal contexts from multiple frames and transferring them to the student network, and the student network is an image-based model dedicated to fast prediction in inference. By distilling multi-frame contexts into a single frame, the proposed V2I-DETR combines the advantages of utilizing temporal contexts from video-based models and the inference speed of image-based models. Through extensive experiments, V2I-DETR outperforms previous state-of-the-art methods by a large margin while achieving the real-time inference speed (30 FPS) as the image-based model.