Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Byun, Ye Won, Jiao, Cathy, Noroozizadeh, Shahriar, Sun, Jimin, Vitiello, Rosa
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning Robotics
Online Access:	https://arxiv.org/abs/2406.17876
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

We introduce a simple method that employs pre-trained CLIP encoders to enhance model generalization in the ALFRED task. In contrast to previous literature where CLIP replaces the visual encoder, we suggest using CLIP as an additional module through an auxiliary object detection objective. We validate our method on the recently proposed Episodic Transformer architecture and demonstrate that incorporating CLIP improves task performance on the unseen validation set. Additionally, our analysis results support that CLIP especially helps with leveraging object descriptions, detecting small objects, and interpreting rare words.

Similar Items