Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Ai, Chaoyi
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2408.05772
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917746236719104
author	Ai, Chaoyi
author_facet	Ai, Chaoyi
contents	Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming $\langle human, object, verb \rangle$ triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.
format	Preprint
id	arxiv_https___arxiv_org_abs_2408_05772
institution	arXiv
publishDate	2024
record_format	arxiv
spellingShingle	An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set Ai, Chaoyi Computer Vision and Pattern Recognition Artificial Intelligence Human-Object Interaction (HOI) aims to identify the pairs of humans and objects in images and to recognize their relationships, ultimately forming $\langle human, object, verb \rangle$ triplets. Under default settings, HOI performance is nearly saturated, with many studies focusing on long-tail distribution and zero-shot/few-shot scenarios. Let us consider an intriguing problem:``What if there is only test dataset without training dataset, using multimodal visual foundation model in a training-free manner? '' This study uses two experimental settings: grounding truth and random arbitrary combinations. We get some interesting conclusion and find that the open vocabulary capabilities of the multimodal visual foundation model are not yet fully realized. Additionally, replacing the feature extraction with grounding DINO further confirms these findings.
title	An analysis of HOI: using a training-free method with multimodal visual foundation models when only the test set is available, without the training set
topic	Computer Vision and Pattern Recognition Artificial Intelligence
url	https://arxiv.org/abs/2408.05772

Similar Items