Saved in:
| Main Authors: | Qiu, Longtian, Ning, Shan, He, Xuming |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2401.02347 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
by: Ning, Shan, et al.
Published: (2026)
by: Ning, Shan, et al.
Published: (2026)
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
by: Qiu, Longtian, et al.
Published: (2025)
by: Qiu, Longtian, et al.
Published: (2025)
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
by: Ning, Shan, et al.
Published: (2026)
by: Ning, Shan, et al.
Published: (2026)
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
by: Wang, Bingli, et al.
Published: (2026)
by: Wang, Bingli, et al.
Published: (2026)
Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting
by: Chen, Wenting, et al.
Published: (2024)
by: Chen, Wenting, et al.
Published: (2024)
FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
by: Asokan, Mothilal, et al.
Published: (2025)
by: Asokan, Mothilal, et al.
Published: (2025)
Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation
by: Chen, Wenting, et al.
Published: (2023)
by: Chen, Wenting, et al.
Published: (2023)
ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
by: Liu, Hanpeng, et al.
Published: (2026)
by: Liu, Hanpeng, et al.
Published: (2026)
Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
by: Mohammad, Mohammed Rahman Sherif Khan, et al.
Published: (2026)
by: Mohammad, Mohammed Rahman Sherif Khan, et al.
Published: (2026)
Text-only Synthesis for Image Captioning
by: Zhou, Qing, et al.
Published: (2024)
by: Zhou, Qing, et al.
Published: (2024)
CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
by: Shivika, et al.
Published: (2026)
by: Shivika, et al.
Published: (2026)
LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
by: Hu, Junyi, et al.
Published: (2026)
by: Hu, Junyi, et al.
Published: (2026)
Is Your Text-to-Image Model Robust to Caption Noise?
by: Yu, Weichen, et al.
Published: (2024)
by: Yu, Weichen, et al.
Published: (2024)
Parrot Captions Teach CLIP to Spot Text
by: Lin, Yiqi, et al.
Published: (2023)
by: Lin, Yiqi, et al.
Published: (2023)
Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model
by: Chen, Hongxu, et al.
Published: (2025)
by: Chen, Hongxu, et al.
Published: (2025)
MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment
by: Banerjee, Sagarika, et al.
Published: (2026)
by: Banerjee, Sagarika, et al.
Published: (2026)
How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
by: Brack, Manuel, et al.
Published: (2025)
by: Brack, Manuel, et al.
Published: (2025)
Image Captions are Natural Prompts for Text-to-Image Models
by: Lei, Shiye, et al.
Published: (2023)
by: Lei, Shiye, et al.
Published: (2023)
Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion
by: Allgeuer, Philipp, et al.
Published: (2024)
by: Allgeuer, Philipp, et al.
Published: (2024)
Text-Only Data Synthesis for Vision Language Model Training
by: Yu, Xiaomin, et al.
Published: (2025)
by: Yu, Xiaomin, et al.
Published: (2025)
TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
by: Li, Wei, et al.
Published: (2024)
by: Li, Wei, et al.
Published: (2024)
RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
by: Jiang, Liyao, et al.
Published: (2026)
by: Jiang, Liyao, et al.
Published: (2026)
TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability
by: Ma, Fengji, et al.
Published: (2024)
by: Ma, Fengji, et al.
Published: (2024)
Towards Fine-Grained Human Motion Video Captioning
by: Song, Guorui, et al.
Published: (2025)
by: Song, Guorui, et al.
Published: (2025)
LACON: Training Text-to-Image Model from Uncurated Data
by: Liang, Zhiyang, et al.
Published: (2026)
by: Liang, Zhiyang, et al.
Published: (2026)
Pseudo-label Based Domain Adaptation for Zero-Shot Text Steganalysis
by: Luo, Yufei, et al.
Published: (2024)
by: Luo, Yufei, et al.
Published: (2024)
VersusDebias: Universal Zero-Shot Debiasing for Text-to-Image Models via SLM-Based Prompt Engineering and Generative Adversary
by: Luo, Hanjun, et al.
Published: (2024)
by: Luo, Hanjun, et al.
Published: (2024)
Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
by: Wang, Siyuan, et al.
Published: (2026)
by: Wang, Siyuan, et al.
Published: (2026)
FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
by: Hu, Ming, et al.
Published: (2026)
by: Hu, Ming, et al.
Published: (2026)
Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models
by: Zhang, Huixuan, et al.
Published: (2025)
by: Zhang, Huixuan, et al.
Published: (2025)
EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
by: Jagpal, Diljeet, et al.
Published: (2025)
by: Jagpal, Diljeet, et al.
Published: (2025)
Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning
by: Zhang, Zhenyu, et al.
Published: (2023)
by: Zhang, Zhenyu, et al.
Published: (2023)
Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt
by: Lin, Xingtao, et al.
Published: (2024)
by: Lin, Xingtao, et al.
Published: (2024)
DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax
by: Yuan, Hang, et al.
Published: (2026)
by: Yuan, Hang, et al.
Published: (2026)
SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
by: Zhou, Sashuai, et al.
Published: (2026)
by: Zhou, Sashuai, et al.
Published: (2026)
VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
by: Kim, Dahun, et al.
Published: (2025)
by: Kim, Dahun, et al.
Published: (2025)
Personalized Safety Alignment for Text-to-Image Diffusion Models
by: Lei, Yu, et al.
Published: (2025)
by: Lei, Yu, et al.
Published: (2025)
Instant Preference Alignment for Text-to-Image Diffusion Models
by: Li, Yang, et al.
Published: (2025)
by: Li, Yang, et al.
Published: (2025)
Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models
by: Kim, Sanghyun, et al.
Published: (2024)
by: Kim, Sanghyun, et al.
Published: (2024)
Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval
by: Lu, Xin, et al.
Published: (2023)
by: Lu, Xin, et al.
Published: (2023)
Similar Items
-
Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
by: Ning, Shan, et al.
Published: (2026) -
NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
by: Qiu, Longtian, et al.
Published: (2025) -
WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
by: Ning, Shan, et al.
Published: (2026) -
COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
by: Wang, Bingli, et al.
Published: (2026) -
Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting
by: Chen, Wenting, et al.
Published: (2024)