Saved in:
Bibliographic Details
Main Authors: Saxena, Pranav, Bhattacharya, Avigyan, Zhang, Ji, Wang, Wenshan
Format: Preprint
Published: 2025
Subjects:
Online Access:https://arxiv.org/abs/2509.25528
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866914104803852288
author Saxena, Pranav
Bhattacharya, Avigyan
Zhang, Ji
Wang, Wenshan
author_facet Saxena, Pranav
Bhattacharya, Avigyan
Zhang, Ji
Wang, Wenshan
contents Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
format Preprint
id arxiv_https___arxiv_org_abs_2509_25528
institution arXiv
publishDate 2025
record_format arxiv
spellingShingle LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
Saxena, Pranav
Bhattacharya, Avigyan
Zhang, Ji
Wang, Wenshan
Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
title LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
topic Computer Vision and Pattern Recognition
Artificial Intelligence
Robotics
url https://arxiv.org/abs/2509.25528