Saved in:
Bibliographic Details
Main Authors: Ren, Tianhe, Liu, Shilong, Zeng, Ailing, Lin, Jing, Li, Kunchang, Cao, He, Chen, Jiayu, Huang, Xinyu, Chen, Yukang, Yan, Feng, Zeng, Zhaoyang, Zhang, Hao, Li, Feng, Yang, Jie, Li, Hongyang, Jiang, Qing, Zhang, Lei
Format: Preprint
Published: 2024
Subjects:
Online Access:https://arxiv.org/abs/2401.14159
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866929224122630144
author Ren, Tianhe
Liu, Shilong
Zeng, Ailing
Lin, Jing
Li, Kunchang
Cao, He
Chen, Jiayu
Huang, Xinyu
Chen, Yukang
Yan, Feng
Zeng, Zhaoyang
Zhang, Hao
Li, Feng
Yang, Jie
Li, Hongyang
Jiang, Qing
Zhang, Lei
author_facet Ren, Tianhe
Liu, Shilong
Zeng, Ailing
Lin, Jing
Li, Kunchang
Cao, He
Chen, Jiayu
Huang, Xinyu
Chen, Yukang
Yan, Feng
Zeng, Zhaoyang
Zhang, Hao
Li, Feng
Yang, Jie
Li, Hongyang
Jiang, Qing
Zhang, Lei
contents We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.
format Preprint
id arxiv_https___arxiv_org_abs_2401_14159
institution arXiv
publishDate 2024
record_format arxiv
spellingShingle Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
Ren, Tianhe
Liu, Shilong
Zeng, Ailing
Lin, Jing
Li, Kunchang
Cao, He
Chen, Jiayu
Huang, Xinyu
Chen, Yukang
Yan, Feng
Zeng, Zhaoyang
Zhang, Hao
Li, Feng
Yang, Jie
Li, Hongyang
Jiang, Qing
Zhang, Lei
Computer Vision and Pattern Recognition
We introduce Grounded SAM, which uses Grounding DINO as an open-set object detector to combine with the segment anything model (SAM). This integration enables the detection and segmentation of any regions based on arbitrary text inputs and opens a door to connecting various vision models. As shown in Fig.1, a wide range of vision tasks can be achieved by using the versatile Grounded SAM pipeline. For example, an automatic annotation pipeline based solely on input images can be realized by incorporating models such as BLIP and Recognize Anything. Additionally, incorporating Stable-Diffusion allows for controllable image editing, while the integration of OSX facilitates promptable 3D human motion analysis. Grounded SAM also shows superior performance on open-vocabulary benchmarks, achieving 48.7 mean AP on SegInW (Segmentation in the wild) zero-shot benchmark with the combination of Grounding DINO-Base and SAM-Huge models.
title Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2401.14159