Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Choraria, Moulik, Wu, Xinbo, Basu, Sourya, Sekhar, Nitesh, Wu, Yue, Zhang, Xu, Singhal, Prateek, Varshney, Lav R.
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2311.07449
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866909430737534976
author	Choraria, Moulik Wu, Xinbo Basu, Sourya Sekhar, Nitesh Wu, Yue Zhang, Xu Singhal, Prateek Varshney, Lav R.
author_facet	Choraria, Moulik Wu, Xinbo Basu, Sourya Sekhar, Nitesh Wu, Yue Zhang, Xu Singhal, Prateek Varshney, Lav R.
contents	General purpose Vision Language Models (VLMs) have received tremendous interest in recent years, owing to their ability to learn rich vision-language correlations as well as their broad zero-shot competencies. One immensely popular line of work utilizes frozen unimodal models, by bridging vision representations to language using a trainable module called the QFormer. However, this method relies heavily on large-scale multimodal pretraining with huge computational overheads. To that end, we propose a more efficient framework for QFormer-based vision-language alignment. Our key idea relies on the observation that QFormer latents correspond more strongly to the frozen LLM's intermediate latent space. Consequently, instead of using QFormer latents as inputs to the LLM, we alter the framework by using the latents to directly condition the LLM latent space for image-to-text generation. We demonstrate the effectiveness of our approach against existing baselines in improving the efficiency of vision-language pretraining.
format	Preprint
id	arxiv_https___arxiv_org_abs_2311_07449
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	Semantically Grounded QFormer for Efficient Vision Language Understanding Choraria, Moulik Wu, Xinbo Basu, Sourya Sekhar, Nitesh Wu, Yue Zhang, Xu Singhal, Prateek Varshney, Lav R. Computer Vision and Pattern Recognition General purpose Vision Language Models (VLMs) have received tremendous interest in recent years, owing to their ability to learn rich vision-language correlations as well as their broad zero-shot competencies. One immensely popular line of work utilizes frozen unimodal models, by bridging vision representations to language using a trainable module called the QFormer. However, this method relies heavily on large-scale multimodal pretraining with huge computational overheads. To that end, we propose a more efficient framework for QFormer-based vision-language alignment. Our key idea relies on the observation that QFormer latents correspond more strongly to the frozen LLM's intermediate latent space. Consequently, instead of using QFormer latents as inputs to the LLM, we alter the framework by using the latents to directly condition the LLM latent space for image-to-text generation. We demonstrate the effectiveness of our approach against existing baselines in improving the efficiency of vision-language pretraining.
title	Semantically Grounded QFormer for Efficient Vision Language Understanding
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2311.07449

Similar Items