:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhang, Jie, Yu, Xingtong, Fang, Yuan, Stouffs, Rudi, Trivic, Zdravko
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2602.08342
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

From Pixels to Predicates Structuring urban perception with scene graphs
by: Liu, Yunlong, et al.
Published: (2025)

SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence
by: Chen, Jiabin, et al.
Published: (2025)

UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
by: Feng, Jie, et al.
Published: (2025)

UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024)

StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
by: Wen, Zhihao, et al.
Published: (2025)

Fine-Grained Urban Flow Inference with Multi-scale Representation Learning
by: Yuan, Shilu, et al.
Published: (2024)

WalkCLIP: Multimodal Learning for Urban Walkability Prediction
by: Xiang, Shilong, et al.
Published: (2025)

UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
by: Zhao, Baining, et al.
Published: (2025)

SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
by: Ye, Guanting, et al.
Published: (2026)

UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
by: Gu, Tiancheng, et al.
Published: (2025)

FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
by: Xu, Yunzhe, et al.
Published: (2024)

UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
by: Li, Anqi, et al.
Published: (2025)

An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
by: Shiri, Fatemeh, et al.
Published: (2024)

PixelBytes: Catching Unified Embedding for Multimodal Generation
by: Furfaro, Fabien
Published: (2024)

AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph Generation
by: Li, Yansheng, et al.
Published: (2024)

VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings
by: Giahi, Ramin, et al.
Published: (2025)

UrbanSense:A Framework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
by: Yin, Jun, et al.
Published: (2025)

Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description
by: Cheng, Xiaohan, et al.
Published: (2024)

Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
by: Kim, Donggeun, et al.
Published: (2024)

MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
by: Wu, Wayne, et al.
Published: (2024)

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction
by: Hao, Xixuan, et al.
Published: (2024)

A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition
by: Shen, Yaomin, et al.
Published: (2025)

Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
by: Zhou, Yue, et al.
Published: (2025)

UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
by: Liu, Mingxuan, et al.
Published: (2025)

CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments
by: Xu, Haotian, et al.
Published: (2026)

SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments
by: Jiang, Wen, et al.
Published: (2026)

Spatial-Temporal Deep Embedding for Vehicle Trajectory Reconstruction from High-Angle Video
by: D., Tianya T. Zhang Ph., et al.
Published: (2022)

Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
by: He, Lixuan, et al.
Published: (2025)

SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation
by: Chen, Jielin, et al.
Published: (2024)

Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
by: Mushkani, Rashid
Published: (2025)

Modular Embedding Recomposition for Incremental Learning
by: Panariello, Aniello, et al.
Published: (2025)

REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
by: Xu, Weihan, et al.
Published: (2025)

AdaEmbed: Semi-supervised Domain Adaptation in the Embedding Space
by: Mottaghi, Ali, et al.
Published: (2024)

SENSE: Self-Supervised Neural Embeddings for Spatial Ensembles
by: Gadirov, Hamid, et al.
Published: (2025)

Learning Robust Intervention Representations with Delta Embeddings
by: Alimisis, Panagiotis, et al.
Published: (2025)

AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
by: Zeng, Kang, et al.
Published: (2025)

GERA: Geometric Embedding for Efficient Point Registration Analysis
by: Li, Geng, et al.
Published: (2024)

MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
by: Fu, Chenghan, et al.
Published: (2025)

Unsupervised Document and Template Clustering using Multimodal Embeddings
by: Sampaio, Phillipe R., et al.
Published: (2025)

From Geometric Mimicry to Comprehensive Generation: A Context-Informed Multimodal Diffusion Model for Urban Morphology Synthesis
by: Zhou, Fangshuo, et al.
Published: (2024)