:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Su, Yukun, Cao, Yiwen, Deng, Jingliang, Rao, Fengyun, Wu, Qingyao
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2401.08086
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

ObjEmbed: Towards Universal Multimodal Object Embeddings
by: Fu, Shenghao, et al.
Published: (2026)

WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
by: Fu, Shenghao, et al.
Published: (2025)

HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
by: Zhou, Zitang, et al.
Published: (2025)

Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark
by: Wu, Yongliang, et al.
Published: (2024)

Unleashing Network Potentials for Semantic Scene Completion
by: Wang, Fengyun, et al.
Published: (2024)

SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization
by: Zhong, Xiaojing, et al.
Published: (2023)

GPHM: Gaussian Parametric Head Model for Monocular Head Avatar Reconstruction
by: Xu, Yuelang, et al.
Published: (2024)

D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
by: Tang, Changli, et al.
Published: (2026)

WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
by: Yang, Jie, et al.
Published: (2025)

Semantic-Enriched Latent Visual Reasoning
by: Xu, Tianrun, et al.
Published: (2026)

MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
by: Yang, Jian, et al.
Published: (2024)

Revisiting Video Quality Assessment from the Perspective of Generalization
by: Yue, Xinli, et al.
Published: (2024)

Content and Salient Semantics Collaboration for Cloth-Changing Person Re-Identification
by: Wang, Qizao, et al.
Published: (2024)

Multi-Modal Generative Embedding Model
by: Ma, Feipeng, et al.
Published: (2024)

FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
by: Zhang, Yunzhu, et al.
Published: (2025)

Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
by: Wang, Zitian, et al.
Published: (2025)

Video Anomaly Detection with Semantics-Aware Information Bottleneck
by: Li, Juntong, et al.
Published: (2025)

Number it: Temporal Grounding Videos like Flipping Manga
by: Wu, Yongliang, et al.
Published: (2024)

R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
by: Yang, Yi, et al.
Published: (2025)

GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
by: Huang, Kaiyi, et al.
Published: (2024)

WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)

Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
by: Yue, Xinli, et al.
Published: (2025)

OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
by: Zhao, Ruixiang, et al.
Published: (2026)

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
by: Li, Yong, et al.
Published: (2026)

HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
by: Wei, Zhixiang, et al.
Published: (2025)

DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data
by: Zhang, Jiafei, et al.
Published: (2026)

WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025)

An Automated Deep Segmentation and Spatial-Statistics Approach for Post-Blast Rock Fragmentation Assessment
by: Yang, Yukun
Published: (2025)

MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
by: Li, Mingcheng, et al.
Published: (2025)

DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
by: Yang, Yunhan, et al.
Published: (2025)

From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
by: Suo, Yucheng, et al.
Published: (2025)

GazeGen: Gaze-Driven User Interaction for Visual Content Generation
by: Hsieh, He-Yen, et al.
Published: (2024)

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
by: Dang, Lingwei, et al.
Published: (2025)

SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
by: He, Xiaoxuan, et al.
Published: (2026)

TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
by: He, Xiaoxuan, et al.
Published: (2025)

Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation
by: Hossain, Nazia, et al.
Published: (2026)

Hybrid Classification-Regression Adaptive Loss for Dense Object Detection
by: Huang, Yanquan, et al.
Published: (2024)

Pseudo-Labeling by Multi-Policy Viewfinder Network for Image Cropping
by: Pan, Zhiyu, et al.
Published: (2024)

OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
by: Yang, Yunhan, et al.
Published: (2025)

SITSMamba for Crop Classification based on Satellite Image Time Series
by: Qin, Xiaolei, et al.
Published: (2024)