Saved in:
Bibliographic Details
Main Authors: Choraria, Moulik, Wu, Xinbo, Basu, Sourya, Sekhar, Nitesh, Wu, Yue, Zhang, Xu, Singhal, Prateek, Varshney, Lav R.
Format: Preprint
Published: 2023
Subjects:
Online Access:https://arxiv.org/abs/2311.07449
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866909430737534976
author Choraria, Moulik
Wu, Xinbo
Basu, Sourya
Sekhar, Nitesh
Wu, Yue
Zhang, Xu
Singhal, Prateek
Varshney, Lav R.
author_facet Choraria, Moulik
Wu, Xinbo
Basu, Sourya
Sekhar, Nitesh
Wu, Yue
Zhang, Xu
Singhal, Prateek
Varshney, Lav R.
contents General purpose Vision Language Models (VLMs) have received tremendous interest in recent years, owing to their ability to learn rich vision-language correlations as well as their broad zero-shot competencies. One immensely popular line of work utilizes frozen unimodal models, by bridging vision representations to language using a trainable module called the QFormer. However, this method relies heavily on large-scale multimodal pretraining with huge computational overheads. To that end, we propose a more efficient framework for QFormer-based vision-language alignment. Our key idea relies on the observation that QFormer latents correspond more strongly to the frozen LLM's intermediate latent space. Consequently, instead of using QFormer latents as inputs to the LLM, we alter the framework by using the latents to directly condition the LLM latent space for image-to-text generation. We demonstrate the effectiveness of our approach against existing baselines in improving the efficiency of vision-language pretraining.
format Preprint
id arxiv_https___arxiv_org_abs_2311_07449
institution arXiv
publishDate 2023
record_format arxiv
spellingShingle Semantically Grounded QFormer for Efficient Vision Language Understanding
Choraria, Moulik
Wu, Xinbo
Basu, Sourya
Sekhar, Nitesh
Wu, Yue
Zhang, Xu
Singhal, Prateek
Varshney, Lav R.
Computer Vision and Pattern Recognition
General purpose Vision Language Models (VLMs) have received tremendous interest in recent years, owing to their ability to learn rich vision-language correlations as well as their broad zero-shot competencies. One immensely popular line of work utilizes frozen unimodal models, by bridging vision representations to language using a trainable module called the QFormer. However, this method relies heavily on large-scale multimodal pretraining with huge computational overheads. To that end, we propose a more efficient framework for QFormer-based vision-language alignment. Our key idea relies on the observation that QFormer latents correspond more strongly to the frozen LLM's intermediate latent space. Consequently, instead of using QFormer latents as inputs to the LLM, we alter the framework by using the latents to directly condition the LLM latent space for image-to-text generation. We demonstrate the effectiveness of our approach against existing baselines in improving the efficiency of vision-language pretraining.
title Semantically Grounded QFormer for Efficient Vision Language Understanding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2311.07449