Table of Contents: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Pate, Seth, Wong, Lawson L. S.
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2410.03900
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

We study the task of locating a user in a mapped indoor environment using natural language queries and images from the environment. Building on recent pretrained vision-language models, we learn a similarity score between text descriptions and images of locations in the environment. This score allows us to identify locations that best match the language query, estimating the user's location. Our approach is capable of localizing on environments, text, and images that were not seen during training. One model, finetuned CLIP, outperformed humans in our evaluation.

Similar Items