Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Saxena, Pranav, Bhattacharya, Avigyan, Zhang, Ji, Wang, Wenshan
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Robotics
Online Access:	https://arxiv.org/abs/2509.25528
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866914104803852288
author	Saxena, Pranav Bhattacharya, Avigyan Zhang, Ji Wang, Wenshan
author_facet	Saxena, Pranav Bhattacharya, Avigyan Zhang, Ji Wang, Wenshan
contents	Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
format	Preprint
id	arxiv_https___arxiv_org_abs_2509_25528
institution	arXiv
publishDate	2025
record_format	arxiv
spellingShingle	LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models Saxena, Pranav Bhattacharya, Avigyan Zhang, Ji Wang, Wenshan Computer Vision and Pattern Recognition Artificial Intelligence Robotics Referential grounding in outdoor driving scenes is challenging due to large scene variability, many visually similar objects, and dynamic elements that complicate resolving natural-language references (e.g., "the black car on the right"). We propose LLM-RG, a hybrid pipeline that combines off-the-shelf vision-language models for fine-grained attribute extraction with large language models for symbolic reasoning. LLM-RG processes an image and a free-form referring expression by using an LLM to extract relevant object types and attributes, detecting candidate regions, generating rich visual descriptors with a VLM, and then combining these descriptors with spatial metadata into natural-language prompts that are input to an LLM for chain-of-thought reasoning to identify the referent's bounding box. Evaluated on the Talk2Car benchmark, LLM-RG yields substantial gains over both LLM and VLM-based baselines. Additionally, our ablations show that adding 3D spatial cues further improves grounding. Our results demonstrate the complementary strengths of VLMs and LLMs, applied in a zero-shot manner, for robust outdoor referential grounding.
title	LLM-RG: Referential Grounding in Outdoor Scenarios using Large Language Models
topic	Computer Vision and Pattern Recognition Artificial Intelligence Robotics
url	https://arxiv.org/abs/2509.25528

Similar Items