Saved in:
| Main Authors: | Su, Yukun, Cao, Yiwen, Deng, Jingliang, Rao, Fengyun, Wu, Qingyao |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.08086 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
ObjEmbed: Towards Universal Multimodal Object Embeddings
by: Fu, Shenghao, et al.
Published: (2026)
by: Fu, Shenghao, et al.
Published: (2026)
WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
by: Fu, Shenghao, et al.
Published: (2025)
by: Fu, Shenghao, et al.
Published: (2025)
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
by: Zhou, Zitang, et al.
Published: (2025)
by: Zhou, Zitang, et al.
Published: (2025)
Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark
by: Wu, Yongliang, et al.
Published: (2024)
by: Wu, Yongliang, et al.
Published: (2024)
Unleashing Network Potentials for Semantic Scene Completion
by: Wang, Fengyun, et al.
Published: (2024)
by: Wang, Fengyun, et al.
Published: (2024)
SARA: Controllable Makeup Transfer with Spatial Alignment and Region-Adaptive Normalization
by: Zhong, Xiaojing, et al.
Published: (2023)
by: Zhong, Xiaojing, et al.
Published: (2023)
GPHM: Gaussian Parametric Head Model for Monocular Head Avatar Reconstruction
by: Xu, Yuelang, et al.
Published: (2024)
by: Xu, Yuelang, et al.
Published: (2024)
D-ORCA: Dialogue-Centric Optimization for Robust Audio-Visual Captioning
by: Tang, Changli, et al.
Published: (2026)
by: Tang, Changli, et al.
Published: (2026)
WeThink: Toward General-purpose Vision-Language Reasoning via Reinforcement Learning
by: Yang, Jie, et al.
Published: (2025)
by: Yang, Jie, et al.
Published: (2025)
Semantic-Enriched Latent Visual Reasoning
by: Xu, Tianrun, et al.
Published: (2026)
by: Xu, Tianrun, et al.
Published: (2026)
MMAR: Towards Lossless Multi-Modal Auto-Regressive Probabilistic Modeling
by: Yang, Jian, et al.
Published: (2024)
by: Yang, Jian, et al.
Published: (2024)
Revisiting Video Quality Assessment from the Perspective of Generalization
by: Yue, Xinli, et al.
Published: (2024)
by: Yue, Xinli, et al.
Published: (2024)
Content and Salient Semantics Collaboration for Cloth-Changing Person Re-Identification
by: Wang, Qizao, et al.
Published: (2024)
by: Wang, Qizao, et al.
Published: (2024)
Multi-Modal Generative Embedding Model
by: Ma, Feipeng, et al.
Published: (2024)
by: Ma, Feipeng, et al.
Published: (2024)
FlexSelect: Flexible Token Selection for Efficient Long Video Understanding
by: Zhang, Yunzhu, et al.
Published: (2025)
by: Zhang, Yunzhu, et al.
Published: (2025)
Instruction-Oriented Preference Alignment for Enhancing Multi-Modal Comprehension Capability of MLLMs
by: Wang, Zitian, et al.
Published: (2025)
by: Wang, Zitian, et al.
Published: (2025)
Video Anomaly Detection with Semantics-Aware Information Bottleneck
by: Li, Juntong, et al.
Published: (2025)
by: Li, Juntong, et al.
Published: (2025)
Number it: Temporal Grounding Videos like Flipping Manga
by: Wu, Yongliang, et al.
Published: (2024)
by: Wu, Yongliang, et al.
Published: (2024)
R1-Onevision: Advancing Generalized Multimodal Reasoning through Cross-Modal Formalization
by: Yang, Yi, et al.
Published: (2025)
by: Yang, Yi, et al.
Published: (2025)
GenMAC: Compositional Text-to-Video Generation with Multi-Agent Collaboration
by: Huang, Kaiyi, et al.
Published: (2024)
by: Huang, Kaiyi, et al.
Published: (2024)
WAVE: Learning Unified & Versatile Audio-Visual Embeddings with Multimodal LLM
by: Tang, Changli, et al.
Published: (2025)
by: Tang, Changli, et al.
Published: (2025)
Instruction-augmented Multimodal Alignment for Image-Text and Element Matching
by: Yue, Xinli, et al.
Published: (2025)
by: Yue, Xinli, et al.
Published: (2025)
OmniPro: A Comprehensive Benchmark for Omni-Proactive Streaming Video Understanding
by: Zhao, Ruixiang, et al.
Published: (2026)
by: Zhao, Ruixiang, et al.
Published: (2026)
REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization
by: Li, Yong, et al.
Published: (2026)
by: Li, Yong, et al.
Published: (2026)
HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets and CLIP Models
by: Wei, Zhixiang, et al.
Published: (2025)
by: Wei, Zhixiang, et al.
Published: (2025)
DepthCropSeg++: Scaling a Crop Segmentation Foundation Model With Depth-Labeled Data
by: Zhang, Jiafei, et al.
Published: (2026)
by: Zhang, Jiafei, et al.
Published: (2026)
WeMMU: Enhanced Bridging of Vision-Language Models and Diffusion Models via Noisy Query Tokens
by: Yang, Jian, et al.
Published: (2025)
by: Yang, Jian, et al.
Published: (2025)
An Automated Deep Segmentation and Spatial-Statistics Approach for Post-Blast Rock Fragmentation Assessment
by: Yang, Yukun
Published: (2025)
by: Yang, Yukun
Published: (2025)
MCCD: Multi-Agent Collaboration-based Compositional Diffusion for Complex Text-to-Image Generation
by: Li, Mingcheng, et al.
Published: (2025)
by: Li, Mingcheng, et al.
Published: (2025)
DreamComposer++: Empowering Diffusion Models with Multi-View Conditions for 3D Content Generation
by: Yang, Yunhan, et al.
Published: (2025)
by: Yang, Yunhan, et al.
Published: (2025)
From Trial to Triumph: Advancing Long Video Understanding via Visual Context Sample Scaling and Self-reward Alignment
by: Suo, Yucheng, et al.
Published: (2025)
by: Suo, Yucheng, et al.
Published: (2025)
GazeGen: Gaze-Driven User Interaction for Visual Content Generation
by: Hsieh, He-Yen, et al.
Published: (2024)
by: Hsieh, He-Yen, et al.
Published: (2024)
SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
by: Dang, Lingwei, et al.
Published: (2025)
by: Dang, Lingwei, et al.
Published: (2025)
SAIL: Self-Amplified Iterative Learning for Diffusion Model Alignment with Minimal Human Feedback
by: He, Xiaoxuan, et al.
Published: (2026)
by: He, Xiaoxuan, et al.
Published: (2026)
TempFlow-GRPO: When Timing Matters for GRPO in Flow Models
by: He, Xiaoxuan, et al.
Published: (2025)
by: He, Xiaoxuan, et al.
Published: (2025)
Vision-Language Semantic Grounding for Multi-Domain Crop-Weed Segmentation
by: Hossain, Nazia, et al.
Published: (2026)
by: Hossain, Nazia, et al.
Published: (2026)
Hybrid Classification-Regression Adaptive Loss for Dense Object Detection
by: Huang, Yanquan, et al.
Published: (2024)
by: Huang, Yanquan, et al.
Published: (2024)
Pseudo-Labeling by Multi-Policy Viewfinder Network for Image Cropping
by: Pan, Zhiyu, et al.
Published: (2024)
by: Pan, Zhiyu, et al.
Published: (2024)
OmniPart: Part-Aware 3D Generation with Semantic Decoupling and Structural Cohesion
by: Yang, Yunhan, et al.
Published: (2025)
by: Yang, Yunhan, et al.
Published: (2025)
SITSMamba for Crop Classification based on Satellite Image Time Series
by: Qin, Xiaolei, et al.
Published: (2024)
by: Qin, Xiaolei, et al.
Published: (2024)
Similar Items
-
ObjEmbed: Towards Universal Multimodal Object Embeddings
by: Fu, Shenghao, et al.
Published: (2026) -
WeDetect: Fast Open-Vocabulary Object Detection as Retrieval
by: Fu, Shenghao, et al.
Published: (2025) -
HarmonySet: A Comprehensive Dataset for Understanding Video-Music Semantic Alignment and Temporal Synchronization
by: Zhou, Zitang, et al.
Published: (2025) -
Video Repurposing from User Generated Content: A Large-scale Dataset and Benchmark
by: Wu, Yongliang, et al.
Published: (2024) -
Unleashing Network Potentials for Semantic Scene Completion
by: Wang, Fengyun, et al.
Published: (2024)