Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Yan, Yibo, Wen, Haomin, Zhong, Siru, Chen, Wei, Chen, Haodong, Wen, Qingsong, Zimmermann, Roger, Liang, Yuxuan
Format:	Preprint
Published:	2023
Subjects:	Computation and Language Artificial Intelligence
Online Access:	https://arxiv.org/abs/2310.18340
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866910378908188672
author	Yan, Yibo Wen, Haomin Zhong, Siru Chen, Wei Chen, Haodong Wen, Qingsong Zimmermann, Roger Liang, Yuxuan
author_facet	Yan, Yibo Wen, Haomin Zhong, Siru Chen, Wei Chen, Haodong Wen, Qingsong Zimmermann, Roger Liang, Yuxuan
contents	Urban region profiling from web-sourced data is of utmost importance for urban planning and sustainable development. We are witnessing a rising trend of LLMs for various fields, especially dealing with multi-modal data research such as vision-language learning, where the text modality serves as a supplement information for the image. Since textual modality has never been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions in this paper: i) Can textual modality enhance urban region profiling? ii) and if so, in what ways and with regard to which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of textual modality into urban imagery profiling, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP). Specifically, it first generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. Then, the model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on predicting three urban indicators in four major Chinese metropolises demonstrate its superior performance, with an average improvement of 6.1% on R^2 compared to the state-of-the-art methods. Our code and the image-language dataset will be released upon paper notification.
format	Preprint
id	arxiv_https___arxiv_org_abs_2310_18340
institution	arXiv
publishDate	2023
record_format	arxiv
spellingShingle	UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web Yan, Yibo Wen, Haomin Zhong, Siru Chen, Wei Chen, Haodong Wen, Qingsong Zimmermann, Roger Liang, Yuxuan Computation and Language Artificial Intelligence Urban region profiling from web-sourced data is of utmost importance for urban planning and sustainable development. We are witnessing a rising trend of LLMs for various fields, especially dealing with multi-modal data research such as vision-language learning, where the text modality serves as a supplement information for the image. Since textual modality has never been introduced into modality combinations in urban region profiling, we aim to answer two fundamental questions in this paper: i) Can textual modality enhance urban region profiling? ii) and if so, in what ways and with regard to which aspects? To answer the questions, we leverage the power of Large Language Models (LLMs) and introduce the first-ever LLM-enhanced framework that integrates the knowledge of textual modality into urban imagery profiling, named LLM-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining (UrbanCLIP). Specifically, it first generates a detailed textual description for each satellite image by an open-source Image-to-Text LLM. Then, the model is trained on the image-text pairs, seamlessly unifying natural language supervision for urban visual representation learning, jointly with contrastive loss and language modeling loss. Results on predicting three urban indicators in four major Chinese metropolises demonstrate its superior performance, with an average improvement of 6.1% on R^2 compared to the state-of-the-art methods. Our code and the image-language dataset will be released upon paper notification.
title	UrbanCLIP: Learning Text-enhanced Urban Region Profiling with Contrastive Language-Image Pretraining from the Web
topic	Computation and Language Artificial Intelligence
url	https://arxiv.org/abs/2310.18340

Similar Items