Saved in:
Bibliographic Details
Main Authors: Byun, Ye Won, Jiao, Cathy, Noroozizadeh, Shahriar, Sun, Jimin, Vitiello, Rosa
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2406.17876
Tags: Add Tag
No Tags, Be the first to tag this record!
Table of Contents:
  • We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.