:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Su, Yiyang, Liu, Xiaoming
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2603.05708
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?
by: Zhu, Jie, et al.
Published: (2026)

HAMoBE: Hierarchical and Adaptive Mixture of Biometric Experts for Video-based Person ReID
by: Su, Yiyang, et al.
Published: (2025)

KeyPoint Relative Position Encoding for Face Recognition
by: Kim, Minchul, et al.
Published: (2024)

SapiensID: Foundation for Human Recognition
by: Kim, Minchul, et al.
Published: (2025)

Open-Set Biometrics: Beyond Good Closed-Set Models
by: Su, Yiyang, et al.
Published: (2024)

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization
by: Wang, Yikun, et al.
Published: (2025)

U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation
by: Deng, Xiang, et al.
Published: (2026)

FusionAgent: A Multimodal Agent with Dynamic Model Selection for Human Recognition
by: Zhu, Jie, et al.
Published: (2026)

A Quality-Guided Mixture of Score-Fusion Experts Framework for Human Recognition
by: Zhu, Jie, et al.
Published: (2025)

Learning to Wander: Improving the Global Image Geolocation Ability of LMMs via Actionable Reasoning
by: Zheng, Yushuo, et al.
Published: (2026)

LocalScore: Local Density-Aware Similarity Scoring for Biometrics
by: Su, Yiyang, et al.
Published: (2026)

Robust Disentangled Counterfactual Learning for Physical Audiovisual Commonsense Reasoning
by: Qi, Mengshi, et al.
Published: (2025)

Statewide Visual Geolocalization in the Wild
by: Fervers, Florian, et al.
Published: (2024)

Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework
by: Song, Zirui, et al.
Published: (2025)

Pushing the Frontier of Audiovisual Perception with Large-Scale Multimodal Correspondence Learning
by: Vyas, Apoorv, et al.
Published: (2025)

Audiovisual Masked Autoencoders
by: Georgescu, Mariana-Iuliana, et al.
Published: (2022)

GeoRC: A Benchmark for Geolocation Reasoning Chains
by: Talreja, Mohit, et al.
Published: (2026)

AVERE: Improving Audiovisual Emotion Reasoning with Preference Optimization
by: Chaubey, Ashutosh, et al.
Published: (2026)

Unmasking Illusions: Understanding Human Perception of Audiovisual Deepfakes
by: Hashmi, Ammarah, et al.
Published: (2024)

Zwitscherkasten -- DIY Audiovisual bird monitoring
by: Blum, Dominik, et al.
Published: (2026)

Rethinking Vision-Language Model in Face Forensics: Multi-Modal Interpretable Forged Face Detector
by: Guo, Xiao, et al.
Published: (2025)

PIGEON: Predicting Image Geolocations
by: Haas, Lukas, et al.
Published: (2023)

GaGA: Towards Interactive Global Geolocation Assistant
by: Dou, Zhiyang, et al.
Published: (2024)

Reasoning-Aligned Perception Decoupling for Scalable Multi-modal Reasoning
by: Gou, Yunhao, et al.
Published: (2025)

VisionReasoner: Unified Reasoning-Integrated Visual Perception via Reinforcement Learning
by: Liu, Yuqi, et al.
Published: (2025)

GeoRouter: Dynamic Paradigm Routing for Worldwide Image Geolocalization
by: Jia, Pengyue, et al.
Published: (2026)

LLMGeo: Benchmarking Large Language Models on Image Geolocation In-the-wild
by: Wang, Zhiqiang, et al.
Published: (2024)

GeoRanker: Distance-Aware Ranking for Worldwide Image Geolocalization
by: Jia, Pengyue, et al.
Published: (2025)

Granular Privacy Control for Geolocation with Vision Language Models
by: Mendes, Ethan, et al.
Published: (2024)

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration
by: Chen, Xinlong, et al.
Published: (2025)

GAReT: Cross-view Video Geolocalization with Adapters and Auto-Regressive Transformers
by: Pillai, Manu S, et al.
Published: (2024)

Frequency-guided Multi-level Reasoning for Scene Graph Generation in Video
by: Li, Chenxing, et al.
Published: (2026)

Referee: Reference-aware Audiovisual Deepfake Detection
by: Boo, Hyemin, et al.
Published: (2025)

HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation
by: Gadi, Hari Krishna, et al.
Published: (2026)

Self-supervised Audiovisual Representation Learning for Remote Sensing Data
by: Heidler, Konrad, et al.
Published: (2021)

X-Streamer: Unified Human World Modeling with Audiovisual Interaction
by: Xie, You, et al.
Published: (2025)

Image-Based Geolocation Using Large Vision-Language Models
by: Liu, Yi, et al.
Published: (2024)

Where on Earth? A Vision-Language Benchmark for Probing Model Geolocation Skills Across Scales
by: Qian, Zhaofang, et al.
Published: (2025)

Combi-CAM: A Novel Multi-Layer Approach for Explainable Image Geolocalization
by: Faget, David, et al.
Published: (2026)

Assessing the Geolocation Capabilities, Limitations and Societal Risks of Generative Vision-Language Models
by: Grainge, Oliver, et al.
Published: (2025)