Saved in:
| Main Authors: | Zhang, Jie, Yu, Xingtong, Fang, Yuan, Stouffs, Rudi, Trivic, Zdravko |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.08342 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
From Pixels to Predicates Structuring urban perception with scene graphs
by: Liu, Yunlong, et al.
Published: (2025)
by: Liu, Yunlong, et al.
Published: (2025)
SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence
by: Chen, Jiabin, et al.
Published: (2025)
by: Chen, Jiabin, et al.
Published: (2025)
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
by: Feng, Jie, et al.
Published: (2025)
by: Feng, Jie, et al.
Published: (2025)
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024)
by: Zhou, Baichuan, et al.
Published: (2024)
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
by: Wen, Zhihao, et al.
Published: (2025)
by: Wen, Zhihao, et al.
Published: (2025)
Fine-Grained Urban Flow Inference with Multi-scale Representation Learning
by: Yuan, Shilu, et al.
Published: (2024)
by: Yuan, Shilu, et al.
Published: (2024)
WalkCLIP: Multimodal Learning for Urban Walkability Prediction
by: Xiang, Shilong, et al.
Published: (2025)
by: Xiang, Shilong, et al.
Published: (2025)
UrbanVideo-Bench: Benchmarking Vision-Language Models on Embodied Intelligence with Video Data in Urban Spaces
by: Zhao, Baining, et al.
Published: (2025)
by: Zhao, Baining, et al.
Published: (2025)
SoPE: Spherical Coordinate-Based Positional Embedding for Enhancing Spatial Perception of 3D LVLMs
by: Ye, Guanting, et al.
Published: (2026)
by: Ye, Guanting, et al.
Published: (2026)
UniME-V2: MLLM-as-a-Judge for Universal Multimodal Embedding Learning
by: Gu, Tiancheng, et al.
Published: (2025)
by: Gu, Tiancheng, et al.
Published: (2025)
FLAME: Learning to Navigate with Multimodal LLM in Urban Environments
by: Xu, Yunzhe, et al.
Published: (2024)
by: Xu, Yunzhe, et al.
Published: (2024)
UrbanVLA: A Vision-Language-Action Model for Urban Micromobility
by: Li, Anqi, et al.
Published: (2025)
by: Li, Anqi, et al.
Published: (2025)
An Empirical Analysis on Spatial Reasoning Capabilities of Large Multimodal Models
by: Shiri, Fatemeh, et al.
Published: (2024)
by: Shiri, Fatemeh, et al.
Published: (2024)
PixelBytes: Catching Unified Embedding for Multimodal Generation
by: Furfaro, Fabien
Published: (2024)
by: Furfaro, Fabien
Published: (2024)
AUG: A New Dataset and An Efficient Model for Aerial Image Urban Scene Graph Generation
by: Li, Yansheng, et al.
Published: (2024)
by: Li, Yansheng, et al.
Published: (2024)
VL-CLIP: Enhancing Multimodal Recommendations via Visual Grounding and LLM-Augmented CLIP Embeddings
by: Giahi, Ramin, et al.
Published: (2025)
by: Giahi, Ramin, et al.
Published: (2025)
UrbanSense:A Framework for Quantitative Analysis of Urban Streetscapes leveraging Vision Large Language Models
by: Yin, Jun, et al.
Published: (2025)
by: Yin, Jun, et al.
Published: (2025)
Algorithm Research of ELMo Word Embedding and Deep Learning Multimodal Transformer in Image Description
by: Cheng, Xiaohan, et al.
Published: (2024)
by: Cheng, Xiaohan, et al.
Published: (2024)
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models
by: Kim, Donggeun, et al.
Published: (2024)
by: Kim, Donggeun, et al.
Published: (2024)
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
by: Wu, Wayne, et al.
Published: (2024)
by: Wu, Wayne, et al.
Published: (2024)
UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction
by: Hao, Xixuan, et al.
Published: (2024)
by: Hao, Xixuan, et al.
Published: (2024)
A-MESS: Anchor based Multimodal Embedding with Semantic Synchronization for Multimodal Intent Recognition
by: Shen, Yaomin, et al.
Published: (2025)
by: Shen, Yaomin, et al.
Published: (2025)
Multimodal Mathematical Reasoning Embedded in Aerial Vehicle Imagery: Benchmarking, Analysis, and Exploration
by: Zhou, Yue, et al.
Published: (2025)
by: Zhou, Yue, et al.
Published: (2025)
UrbanVerse: Scaling Urban Simulation by Watching City-Tour Videos
by: Liu, Mingxuan, et al.
Published: (2025)
by: Liu, Mingxuan, et al.
Published: (2025)
CityCube: Benchmarking Cross-view Spatial Reasoning on Vision-Language Models in Urban Environments
by: Xu, Haotian, et al.
Published: (2026)
by: Xu, Haotian, et al.
Published: (2026)
SpatialFly: Geometry-Guided Representation Alignment for UAV Vision-and-Language Navigation in Urban Environments
by: Jiang, Wen, et al.
Published: (2026)
by: Jiang, Wen, et al.
Published: (2026)
Spatial-Temporal Deep Embedding for Vehicle Trajectory Reconstruction from High-Angle Video
by: D., Tianya T. Zhang Ph., et al.
Published: (2022)
by: D., Tianya T. Zhang Ph., et al.
Published: (2022)
Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
by: He, Lixuan, et al.
Published: (2025)
by: He, Lixuan, et al.
Published: (2025)
SE-VGAE: Unsupervised Disentangled Representation Learning for Interpretable Architectural Layout Design Graph Generation
by: Chen, Jielin, et al.
Published: (2024)
by: Chen, Jielin, et al.
Published: (2024)
Do Vision-Language Models See Urban Scenes as People Do? An Urban Perception Benchmark
by: Mushkani, Rashid
Published: (2025)
by: Mushkani, Rashid
Published: (2025)
Modular Embedding Recomposition for Incremental Learning
by: Panariello, Aniello, et al.
Published: (2025)
by: Panariello, Aniello, et al.
Published: (2025)
REGen: Multimodal Retrieval-Embedded Generation for Long-to-Short Video Editing
by: Xu, Weihan, et al.
Published: (2025)
by: Xu, Weihan, et al.
Published: (2025)
AdaEmbed: Semi-supervised Domain Adaptation in the Embedding Space
by: Mottaghi, Ali, et al.
Published: (2024)
by: Mottaghi, Ali, et al.
Published: (2024)
SENSE: Self-Supervised Neural Embeddings for Spatial Ensembles
by: Gadirov, Hamid, et al.
Published: (2025)
by: Gadirov, Hamid, et al.
Published: (2025)
Learning Robust Intervention Representations with Delta Embeddings
by: Alimisis, Panagiotis, et al.
Published: (2025)
by: Alimisis, Panagiotis, et al.
Published: (2025)
AVAM: Universal Training-free Adaptive Visual Anchoring Embedded into Multimodal Large Language Model for Multi-image Question Answering
by: Zeng, Kang, et al.
Published: (2025)
by: Zeng, Kang, et al.
Published: (2025)
GERA: Geometric Embedding for Efficient Point Registration Analysis
by: Li, Geng, et al.
Published: (2024)
by: Li, Geng, et al.
Published: (2024)
MOON Embedding: Multimodal Representation Learning for E-commerce Search Advertising
by: Fu, Chenghan, et al.
Published: (2025)
by: Fu, Chenghan, et al.
Published: (2025)
Unsupervised Document and Template Clustering using Multimodal Embeddings
by: Sampaio, Phillipe R., et al.
Published: (2025)
by: Sampaio, Phillipe R., et al.
Published: (2025)
From Geometric Mimicry to Comprehensive Generation: A Context-Informed Multimodal Diffusion Model for Urban Morphology Synthesis
by: Zhou, Fangshuo, et al.
Published: (2024)
by: Zhou, Fangshuo, et al.
Published: (2024)
Similar Items
-
From Pixels to Predicates Structuring urban perception with scene graphs
by: Liu, Yunlong, et al.
Published: (2025) -
SpatialLLM: From Multi-modality Data to Urban Spatial Intelligence
by: Chen, Jiabin, et al.
Published: (2025) -
UrbanLLaVA: A Multi-modal Large Language Model for Urban Intelligence with Spatial Reasoning and Understanding
by: Feng, Jie, et al.
Published: (2025) -
UrBench: A Comprehensive Benchmark for Evaluating Large Multimodal Models in Multi-View Urban Scenarios
by: Zhou, Baichuan, et al.
Published: (2024) -
StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering
by: Wen, Zhihao, et al.
Published: (2025)