Saved in:
| Main Authors: | Zhang, Suoxiang, Li, Xiaxi, Chang, Hongrui, Hou, Zhuoyan, Wu, Guoxin, Ji, Ronghua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.05621 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
by: Hong, Jiacheng, et al.
Published: (2025)
by: Hong, Jiacheng, et al.
Published: (2025)
Semantically Consistent Person Image Generation
by: Roy, Prasun, et al.
Published: (2023)
by: Roy, Prasun, et al.
Published: (2023)
PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation
by: Wang, Sen, et al.
Published: (2025)
by: Wang, Sen, et al.
Published: (2025)
Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
by: Liu, Ke, et al.
Published: (2025)
by: Liu, Ke, et al.
Published: (2025)
Visual Semantic Description Generation with MLLMs for Image-Text Matching
by: Chen, Junyu, et al.
Published: (2025)
by: Chen, Junyu, et al.
Published: (2025)
LipGen: Viseme-Guided Lip Video Generation for Enhancing Visual Speech Recognition
by: Hao, Bowen, et al.
Published: (2025)
by: Hao, Bowen, et al.
Published: (2025)
CIV-DG: Conditional Instrumental Variables for Domain Generalization in Medical Imaging
by: Bai, Shaojin, et al.
Published: (2026)
by: Bai, Shaojin, et al.
Published: (2026)
SafePaint: Anti-forensic Image Inpainting with Domain Adaptation
by: Chen, Dunyun, et al.
Published: (2024)
by: Chen, Dunyun, et al.
Published: (2024)
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
by: Zhao, Junchuan, et al.
Published: (2026)
by: Zhao, Junchuan, et al.
Published: (2026)
Self-distilled Dynamic Fusion Network for Language-based Fashion Retrieval
by: Wu, Yiming, et al.
Published: (2024)
by: Wu, Yiming, et al.
Published: (2024)
MaVEn: An Effective Multi-granularity Hybrid Visual Encoding Framework for Multimodal Large Language Model
by: Jiang, Chaoya, et al.
Published: (2024)
by: Jiang, Chaoya, et al.
Published: (2024)
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
by: Zhang, Zhenxing, et al.
Published: (2024)
by: Zhang, Zhenxing, et al.
Published: (2024)
Scene Aware Person Image Generation through Global Contextual Conditioning
by: Roy, Prasun, et al.
Published: (2022)
by: Roy, Prasun, et al.
Published: (2022)
Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation
by: Zhang, Xuesong, et al.
Published: (2024)
by: Zhang, Xuesong, et al.
Published: (2024)
D2SL: Decouple Defogging and Semantic Learning for Foggy Domain-Adaptive Segmentation
by: Sun, Xuan, et al.
Published: (2024)
by: Sun, Xuan, et al.
Published: (2024)
MAGE: Multimodal Alignment and Generation Enhancement via Bridging Visual and Semantic Spaces
by: E, Shaojun, et al.
Published: (2025)
by: E, Shaojun, et al.
Published: (2025)
Towards Real-World Adverse Weather Image Restoration: Enhancing Clearness and Semantics with Vision-Language Models
by: Xu, Jiaqi, et al.
Published: (2024)
by: Xu, Jiaqi, et al.
Published: (2024)
Face Consistency Benchmark for GenAI Video
by: Podstawski, Michal, et al.
Published: (2025)
by: Podstawski, Michal, et al.
Published: (2025)
MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals
by: Yu, Lei, et al.
Published: (2024)
by: Yu, Lei, et al.
Published: (2024)
Towards Open-Vocabulary Remote Sensing Image Semantic Segmentation
by: Ye, Chengyang, et al.
Published: (2024)
by: Ye, Chengyang, et al.
Published: (2024)
MultiSoundGen: Video-to-Audio Generation for Multi-Event Scenarios via SlowFast Contrastive Audio-Visual Pretraining and Direct Preference Optimization
by: Yang, Jianxuan, et al.
Published: (2025)
by: Yang, Jianxuan, et al.
Published: (2025)
Beyond Semantic Search: Towards Referential Anchoring in Composed Image Retrieval
by: Yang, Yuxin, et al.
Published: (2026)
by: Yang, Yuxin, et al.
Published: (2026)
G-Refine: A General Quality Refiner for Text-to-Image Generation
by: Li, Chunyi, et al.
Published: (2024)
by: Li, Chunyi, et al.
Published: (2024)
Multimodal Large Language Model is a Human-Aligned Annotator for Text-to-Image Generation
by: Wu, Xun, et al.
Published: (2024)
by: Wu, Xun, et al.
Published: (2024)
Exploring Phrase-Level Grounding with Text-to-Image Diffusion Model
by: Yang, Danni, et al.
Published: (2024)
by: Yang, Danni, et al.
Published: (2024)
Leveraging multimodal explanatory annotations for video interpretation with Modality Specific Dataset
by: Ancarani, Elisa, et al.
Published: (2025)
by: Ancarani, Elisa, et al.
Published: (2025)
A Unit Enhancement and Guidance Framework for Audio-Driven Avatar Video Generation
by: Zhou, S. Z., et al.
Published: (2025)
by: Zhou, S. Z., et al.
Published: (2025)
Towards Robust Multimodal Emotion Recognition under Missing Modalities and Distribution Shifts
by: Zhong, Guowei, et al.
Published: (2025)
by: Zhong, Guowei, et al.
Published: (2025)
Enhancing Environmental Monitoring through Multispectral Imaging: The WasteMS Dataset for Semantic Segmentation of Lakeside Waste
by: Zhu, Qinfeng, et al.
Published: (2024)
by: Zhu, Qinfeng, et al.
Published: (2024)
MotionPro: A Precise Motion Controller for Image-to-Video Generation
by: Zhang, Zhongwei, et al.
Published: (2025)
by: Zhang, Zhongwei, et al.
Published: (2025)
DIVE: Inverting Conditional Diffusion Models for Discriminative Tasks
by: Li, Yinqi, et al.
Published: (2025)
by: Li, Yinqi, et al.
Published: (2025)
Bridging the Pose-Semantic Gap: A Cascade Framework for Text-Based Person Anomaly Search
by: Xie, Zequn, et al.
Published: (2026)
by: Xie, Zequn, et al.
Published: (2026)
ELIQ: A Label-Free Framework for Quality Assessment of Evolving AI-Generated Images
by: Li, Xinyue, et al.
Published: (2026)
by: Li, Xinyue, et al.
Published: (2026)
G4G:A Generic Framework for High Fidelity Talking Face Generation with Fine-grained Intra-modal Alignment
by: Zhang, Juan, et al.
Published: (2024)
by: Zhang, Juan, et al.
Published: (2024)
Hierarchical Textual Knowledge for Enhanced Image Clustering
by: Zhong, Yijie, et al.
Published: (2026)
by: Zhong, Yijie, et al.
Published: (2026)
Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation
by: Cao, Pu, et al.
Published: (2023)
by: Cao, Pu, et al.
Published: (2023)
MagicAnime: A Hierarchically Annotated, Multimodal and Multitasking Dataset with Benchmarks for Cartoon Animation Generation
by: Xu, Shuolin, et al.
Published: (2025)
by: Xu, Shuolin, et al.
Published: (2025)
Optimized Learned Image Compression for Facial Expression Recognition
by: Li, Xiumei, et al.
Published: (2025)
by: Li, Xiumei, et al.
Published: (2025)
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions
by: Zhang, Rui, et al.
Published: (2024)
by: Zhang, Rui, et al.
Published: (2024)
KAN-Based Fusion of Dual-Domain for Audio-Driven Facial Landmarks Generation
by: Vo-Thanh, Hoang-Son, et al.
Published: (2024)
by: Vo-Thanh, Hoang-Son, et al.
Published: (2024)
Similar Items
-
HiGS: Hierarchical Generative Scene Framework for Multi-Step Associative Semantic Spatial Composition
by: Hong, Jiacheng, et al.
Published: (2025) -
Semantically Consistent Person Image Generation
by: Roy, Prasun, et al.
Published: (2023) -
PanoGen++: Domain-Adapted Text-Guided Panoramic Environment Generation for Vision-and-Language Navigation
by: Wang, Sen, et al.
Published: (2025) -
Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
by: Liu, Ke, et al.
Published: (2025) -
Visual Semantic Description Generation with MLLMs for Image-Text Matching
by: Chen, Junyu, et al.
Published: (2025)