Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Salzmann, Tim, Ryll, Markus, Bewley, Alex, Minderer, Matthias
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning Robotics
Online Access:	https://arxiv.org/abs/2403.14270
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913436259057664
author	Salzmann, Tim Ryll, Markus Bewley, Alex Minderer, Matthias
author_facet	Salzmann, Tim Ryll, Markus Bewley, Alex Minderer, Matthias
contents	Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide ablations, real-world qualitative examples, and analyses of zero-shot performance.
format	Preprint
id	arxiv_https___arxiv_org_abs_2403_14270
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection Salzmann, Tim Ryll, Markus Bewley, Alex Minderer, Matthias Computer Vision and Pattern Recognition Computation and Language Machine Learning Robotics Visual relationship detection aims to identify objects and their relationships in images. Prior methods approach this task by adding separate relationship modules or decoders to existing object detection architectures. This separation increases complexity and hinders end-to-end training, which limits performance. We propose a simple and highly efficient decoder-free architecture for open-vocabulary visual relationship detection. Our model consists of a Transformer-based image encoder that represents objects as tokens and models their relationships implicitly. To extract relationship information, we introduce an attention mechanism that selects object pairs likely to form a relationship. We provide a single-stage recipe to train this model on a mixture of object and relationship detection data. Our approach achieves state-of-the-art relationship detection performance on Visual Genome and on the large-vocabulary GQA benchmark at real-time inference speeds. We provide ablations, real-world qualitative examples, and analyses of zero-shot performance.
title	Scene-Graph ViT: End-to-End Open-Vocabulary Visual Relationship Detection
topic	Computer Vision and Pattern Recognition Computation and Language Machine Learning Robotics
url	https://arxiv.org/abs/2403.14270

Similar Items