:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Du, Keqing, Yang, Xinyu, Chen, Hang
Format:	Preprint
Published:	2023
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2311.12401
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SAM as the Guide: Mastering Pseudo-Label Refinement in Semi-Supervised Referring Expression Segmentation
by: Yang, Danni, et al.
Published: (2024)

IVAC-P2L: Leveraging Irregular Repetition Priors for Improving Video Action Counting
by: Wang, Hang, et al.
Published: (2024)

G-Refine: A General Quality Refiner for Text-to-Image Generation
by: Li, Chunyi, et al.
Published: (2024)

Generative Frame Sampler for Long Video Understanding
by: Yao, Linli, et al.
Published: (2025)

Prompt-aware of Frame Sampling for Efficient Text-Video Retrieval
by: Zhang, Deyu, et al.
Published: (2025)

Boosting Temporal Sentence Grounding via Causal Inference
by: Tang, Kefan, et al.
Published: (2025)

TextRefiner: Internal Visual Feature as Efficient Refiner for Vision-Language Models Prompt Tuning
by: Xie, Jingjing, et al.
Published: (2024)

Where Does Vision Meet Language? Understanding and Refining Visual Fusion in MLLMs via Contrastive Attention
by: Song, Shezheng, et al.
Published: (2026)

Audio-Visual Segmentation via Unlabeled Frame Exploitation
by: Liu, Jinxiang, et al.
Published: (2024)

HFS: Holistic Query-Aware Frame Selection for Efficient Video Reasoning
by: Yang, Yiqing, et al.
Published: (2025)

Noise-Tolerant Learning for Audio-Visual Action Recognition
by: Han, Haochen, et al.
Published: (2022)

Probabilistic Temporal Masked Attention for Cross-view Online Action Detection
by: Xie, Liping, et al.
Published: (2025)

Wavelet-Decoupling Contrastive Enhancement Network for Fine-Grained Skeleton-Based Action Recognition
by: Chang, Haochen, et al.
Published: (2024)

An Empirical Comparison of Video Frame Sampling Methods for Multi-Modal RAG Retrieval
by: Kandhare, Mahesh, et al.
Published: (2024)

SkeFi: Cross-Modal Knowledge Transfer for Wireless Skeleton-Based Action Recognition
by: Huang, Shunyu, et al.
Published: (2026)

Memory-Guided View Refinement for Dynamic Human-in-the-loop EQA
by: Lu, Xin, et al.
Published: (2026)

Joint Flow And Feature Refinement Using Attention For Video Restoration
by: Merugu, Ranjith, et al.
Published: (2025)

Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
by: Zhang, Peng-Fei, et al.
Published: (2026)

Interpretable Concept-based Deep Learning Framework for Multimodal Human Behavior Modeling
by: Li, Xinyu, et al.
Published: (2025)

READ-Net: Clarifying Emotional Ambiguity via Adaptive Feature Recalibration for Audio-Visual Depression Detection
by: Chen, Chenglizhao, et al.
Published: (2026)

Causal Debiasing for Visual Commonsense Reasoning
by: Zou, Jiayi, et al.
Published: (2025)

Can Video Diffusion Models Predict Past Frames? Bidirectional Cycle Consistency for Reversible Interpolation
by: Liu, Lingyu, et al.
Published: (2026)

Human Action Recognition without Human
by: Kataoka, Hirokatsu, et al.
Published: (2016)

Zero-shot Video Moment Retrieval via Off-the-shelf Multimodal Large Language Models
by: Xu, Yifang, et al.
Published: (2025)

Calibration & Reconstruction: Deep Integrated Language for Referring Image Segmentation
by: Yan, Yichen, et al.
Published: (2024)

Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
by: Zhou, Jinxing, et al.
Published: (2024)

CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
by: Lu, Zhenyu, et al.
Published: (2025)

Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)

Semi-supervised Semantic Segmentation with Multi-Constraint Consistency Learning
by: Yin, Jianjian, et al.
Published: (2025)

Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM
by: Zhang, Pingping, et al.
Published: (2024)

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion
by: Chen, Sen, et al.
Published: (2022)

CUS3D :CLIP-based Unsupervised 3D Segmentation via Object-level Denoise
by: Yu, Fuyang, et al.
Published: (2024)

CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
by: Yu, Hang, et al.
Published: (2025)

Multi-scale Activation, Refinement, and Aggregation: Exploring Diverse Cues for Fine-Grained Bird Recognition
by: Zhang, Zhicheng, et al.
Published: (2025)

Unraveling Instance Associations: A Closer Look for Audio-Visual Segmentation
by: Chen, Yuanhong, et al.
Published: (2023)

Rethinking Radiology Report Generation via Causal Inspired Counterfactual Augmentation
by: Song, Xiao, et al.
Published: (2023)

Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities
by: Wang, Junze, et al.
Published: (2025)

Self-similarity Prior Distillation for Unsupervised Remote Physiological Measurement
by: Zhang, Xinyu, et al.
Published: (2023)

POINTS1.5: Building a Vision-Language Model towards Real World Applications
by: Liu, Yuan, et al.
Published: (2024)

EntroAD: Structural Entropy-Guided Prompt Adaptation for Zero-Shot Anomaly Detection
by: Zhao, Xinyu, et al.
Published: (2026)