Saved in:
| Main Authors: | Xiao, Linhui, Yang, Xiaoshan, Peng, Fang, Wang, Yaowei, Xu, Changsheng |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2410.08021 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
by: Xiao, Linhui, et al.
Published: (2024)
by: Xiao, Linhui, et al.
Published: (2024)
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
by: Xiao, Linhui, et al.
Published: (2023)
by: Xiao, Linhui, et al.
Published: (2023)
Towards Visual Grounding: A Survey
by: Xiao, Linhui, et al.
Published: (2024)
by: Xiao, Linhui, et al.
Published: (2024)
Pilot: Building the Federated Multimodal Instruction Tuning Framework
by: Xiong, Baochen, et al.
Published: (2025)
by: Xiong, Baochen, et al.
Published: (2025)
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation
by: He, Shuting, et al.
Published: (2024)
by: He, Shuting, et al.
Published: (2024)
BARE: Towards Bias-Aware and Reasoning-Enhanced One-Tower Visual Grounding
by: Li, Hongbing, et al.
Published: (2026)
by: Li, Hongbing, et al.
Published: (2026)
FineCops-Ref: A new Dataset and Task for Fine-Grained Compositional Referring Expression Comprehension
by: Liu, Junzhuo, et al.
Published: (2024)
by: Liu, Junzhuo, et al.
Published: (2024)
Mask Grounding for Referring Image Segmentation
by: Chng, Yong Xien, et al.
Published: (2023)
by: Chng, Yong Xien, et al.
Published: (2023)
Libra: Building Decoupled Vision System on Large Language Models
by: Xu, Yifan, et al.
Published: (2024)
by: Xu, Yifan, et al.
Published: (2024)
RefCut: Interactive Segmentation with Reference Guidance
by: Lin, Zheng, et al.
Published: (2025)
by: Lin, Zheng, et al.
Published: (2025)
ViRefSAM: Visual Reference-Guided Segment Anything Model for Remote Sensing Segmentation
by: Bi, Hanbo, et al.
Published: (2025)
by: Bi, Hanbo, et al.
Published: (2025)
RefSAM: Efficiently Adapting Segmenting Anything Model for Referring Video Object Segmentation
by: Li, Yonglin, et al.
Published: (2023)
by: Li, Yonglin, et al.
Published: (2023)
RefBench-PRO: Perceptual and Reasoning Oriented Benchmark for Referring Expression Comprehension
by: Gao, Tianyi, et al.
Published: (2025)
by: Gao, Tianyi, et al.
Published: (2025)
RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes
by: Sun, Zhichao, et al.
Published: (2025)
by: Sun, Zhichao, et al.
Published: (2025)
StoryImager: A Unified and Efficient Framework for Coherent Story Visualization and Completion
by: Tao, Ming, et al.
Published: (2024)
by: Tao, Ming, et al.
Published: (2024)
RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
by: Pathiraja, Bimsara, et al.
Published: (2025)
by: Pathiraja, Bimsara, et al.
Published: (2025)
Ref-Adv: Exploring MLLM Visual Reasoning in Referring Expression Tasks
by: Dong, Qihua, et al.
Published: (2026)
by: Dong, Qihua, et al.
Published: (2026)
RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios
by: Huang, Jie, et al.
Published: (2024)
by: Huang, Jie, et al.
Published: (2024)
RefComp: A Reference-guided Unified Framework for Unpaired Point Cloud Completion
by: Yang, Yixuan, et al.
Published: (2025)
by: Yang, Yixuan, et al.
Published: (2025)
A Comprehensive Review of Few-shot Action Recognition
by: Wanyan, Yuyang, et al.
Published: (2024)
by: Wanyan, Yuyang, et al.
Published: (2024)
Do We Need to Design Specific Diffusion Models for Different Tasks? Try ONE-PIC
by: Tao, Ming, et al.
Published: (2024)
by: Tao, Ming, et al.
Published: (2024)
Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes
by: Wang, Yaoting, et al.
Published: (2024)
by: Wang, Yaoting, et al.
Published: (2024)
RefAlign: Representation Alignment for Reference-to-Video Generation
by: Wang, Lei, et al.
Published: (2026)
by: Wang, Lei, et al.
Published: (2026)
TransRef: Multi-Scale Reference Embedding Transformer for Reference-Guided Image Inpainting
by: Liu, Taorong, et al.
Published: (2023)
by: Liu, Taorong, et al.
Published: (2023)
OneVOS: Unifying Video Object Segmentation with All-in-One Transformer Framework
by: Li, Wanyun, et al.
Published: (2024)
by: Li, Wanyun, et al.
Published: (2024)
Latent Expression Generation for Referring Image Segmentation and Grounding
by: Yu, Seonghoon, et al.
Published: (2025)
by: Yu, Seonghoon, et al.
Published: (2025)
Modality-Collaborative Low-Rank Decomposers for Few-Shot Video Domain Adaptation
by: Wanyan, Yuyang, et al.
Published: (2025)
by: Wanyan, Yuyang, et al.
Published: (2025)
Towards Domain-Generalized Open-Vocabulary Object Detection: A Progressive Domain-invariant Cross-modal Alignment Method
by: Xu, Xiaoran, et al.
Published: (2026)
by: Xu, Xiaoran, et al.
Published: (2026)
One for All: Toward Unified Foundation Models for Earth Vision
by: Xiong, Zhitong, et al.
Published: (2024)
by: Xiong, Zhitong, et al.
Published: (2024)
Beyond Referring Expressions: Scenario Comprehension Visual Grounding
by: He, Ruozhen, et al.
Published: (2026)
by: He, Ruozhen, et al.
Published: (2026)
Towards Unified Referring Expression Segmentation Across Omni-Level Visual Target Granularities
by: Liu, Jing, et al.
Published: (2025)
by: Liu, Jing, et al.
Published: (2025)
MotionCrafter: One-Shot Motion Customization of Diffusion Models
by: Zhang, Yuxin, et al.
Published: (2023)
by: Zhang, Yuxin, et al.
Published: (2023)
MVGGT: Multimodal Visual Geometry Grounded Transformer for Multiview 3D Referring Expression Segmentation
by: Wu, Changli, et al.
Published: (2026)
by: Wu, Changli, et al.
Published: (2026)
Harnessing Vision-Language Pretrained Models with Temporal-Aware Adaptation for Referring Video Object Segmentation
by: Zhou, Zikun, et al.
Published: (2024)
by: Zhou, Zikun, et al.
Published: (2024)
Unlocking the Potential of MLLMs in Referring Expression Segmentation via a Light-weight Mask Decoder
by: Wang, Jingchao, et al.
Published: (2025)
by: Wang, Jingchao, et al.
Published: (2025)
Deforming Videos to Masks: Flow Matching for Referring Video Segmentation
by: Wang, Zanyi, et al.
Published: (2025)
by: Wang, Zanyi, et al.
Published: (2025)
OneReward: Unified Mask-Guided Image Generation via Multi-Task Human Preference Learning
by: Gong, Yuan, et al.
Published: (2025)
by: Gong, Yuan, et al.
Published: (2025)
ReferDINO: Referring Video Object Segmentation with Visual Grounding Foundations
by: Liang, Tianming, et al.
Published: (2025)
by: Liang, Tianming, et al.
Published: (2025)
FOCUS: Unified Vision-Language Modeling for Interactive Editing Driven by Referential Segmentation
by: Yang, Fan, et al.
Published: (2025)
by: Yang, Fan, et al.
Published: (2025)
RefTok: Reference-Based Tokenization for Video Generation
by: Fan, Xiang, et al.
Published: (2025)
by: Fan, Xiang, et al.
Published: (2025)
Similar Items
-
HiVG: Hierarchical Multimodal Fine-grained Modulation for Visual Grounding
by: Xiao, Linhui, et al.
Published: (2024) -
CLIP-VG: Self-paced Curriculum Adapting of CLIP for Visual Grounding
by: Xiao, Linhui, et al.
Published: (2023) -
Towards Visual Grounding: A Survey
by: Xiao, Linhui, et al.
Published: (2024) -
Pilot: Building the Federated Multimodal Instruction Tuning Framework
by: Xiong, Baochen, et al.
Published: (2025) -
RefMask3D: Language-Guided Transformer for 3D Referring Segmentation
by: He, Shuting, et al.
Published: (2024)