:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Qiu, Longtian, Ning, Shan, He, Xuming
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2401.02347
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Wiki-R1: Incentivizing Multimodal Reasoning for Knowledge-based VQA via Data and Sampling Curriculum
by: Ning, Shan, et al.
Published: (2026)

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation
by: Qiu, Longtian, et al.
Published: (2025)

WikiCLIP: An Efficient Contrastive Baseline for Open-domain Visual Entity Recognition
by: Ning, Shan, et al.
Published: (2026)

COHERENCE: Benchmarking Fine-Grained Image-Text Alignment in Interleaved Multimodal Contexts
by: Wang, Bingli, et al.
Published: (2026)

Medical Image Synthesis via Fine-Grained Image-Text Alignment and Anatomy-Pathology Prompting
by: Chen, Wenting, et al.
Published: (2024)

FineLIP: Extending CLIP's Reach via Fine-Grained Alignment with Longer Text Inputs
by: Asokan, Mothilal, et al.
Published: (2025)

Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation
by: Chen, Wenting, et al.
Published: (2023)

ITO: Images and Texts as One via Synergizing Multiple Alignment and Training-Time Fusion
by: Liu, Hanpeng, et al.
Published: (2026)

Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters
by: Mohammad, Mohammed Rahman Sherif Khan, et al.
Published: (2026)

Text-only Synthesis for Image Captioning
by: Zhou, Qing, et al.
Published: (2024)

CLIP Architecture for Abdominal CT Image-Text Alignment and Zero-Shot Learning: Investigating Batch Composition and Data Scaling
by: Shivika, et al.
Published: (2026)

LAGO: Language-Guided Adaptive Object-Region Focus for Zero-Shot Visual-Text Alignment
by: Hu, Junyi, et al.
Published: (2026)

Is Your Text-to-Image Model Robust to Caption Noise?
by: Yu, Weichen, et al.
Published: (2024)

Parrot Captions Teach CLIP to Spot Text
by: Lin, Yiqi, et al.
Published: (2023)

Zero-Residual Concept Erasure via Progressive Alignment in Text-to-Image Model
by: Chen, Hongxu, et al.
Published: (2025)

MiSCHiEF: A Benchmark in Minimal-Pairs of Safety and Culture for Holistic Evaluation of Fine-Grained Image-Caption Alignment
by: Banerjee, Sagarika, et al.
Published: (2026)

How to Train your Text-to-Image Model: Evaluating Design Choices for Synthetic Training Captions
by: Brack, Manuel, et al.
Published: (2025)

Image Captions are Natural Prompts for Text-to-Image Models
by: Lei, Shiye, et al.
Published: (2023)

Unconstrained Open Vocabulary Image Classification: Zero-Shot Transfer from Text to Image via CLIP Inversion
by: Allgeuer, Philipp, et al.
Published: (2024)

Text-Only Data Synthesis for Vision Language Model Training
by: Yu, Xiaomin, et al.
Published: (2025)

TOPA: Extending Large Language Models for Video Understanding via Text-Only Pre-Alignment
by: Li, Wei, et al.
Published: (2024)

RAISE: Requirement-Adaptive Evolutionary Refinement for Training-Free Text-to-Image Alignment
by: Jiang, Liyao, et al.
Published: (2026)

TIMA: Text-Image Mutual Awareness for Balancing Zero-Shot Adversarial Robustness and Generalization Ability
by: Ma, Fengji, et al.
Published: (2024)

Towards Fine-Grained Human Motion Video Captioning
by: Song, Guorui, et al.
Published: (2025)

LACON: Training Text-to-Image Model from Uncurated Data
by: Liang, Zhiyang, et al.
Published: (2026)

Pseudo-label Based Domain Adaptation for Zero-Shot Text Steganalysis
by: Luo, Yufei, et al.
Published: (2024)

VersusDebias: Universal Zero-Shot Debiasing for Text-to-Image Models via SLM-Based Prompt Engineering and Generative Adversary
by: Luo, Hanjun, et al.
Published: (2024)

Sketch and Text Synergy: Fusing Structural Contours and Descriptive Attributes for Fine-Grained Image Retrieval
by: Wang, Siyuan, et al.
Published: (2026)

FB-CLIP: Fine-Grained Zero-Shot Anomaly Detection with Foreground-Background Disentanglement
by: Hu, Ming, et al.
Published: (2026)

Re-Thinking the Automatic Evaluation of Image-Text Alignment in Text-to-Image Models
by: Zhang, Huixuan, et al.
Published: (2025)

EIDT-V: Exploiting Intersections in Diffusion Trajectories for Model-Agnostic, Zero-Shot, Training-Free Text-to-Video Generation
by: Jagpal, Diljeet, et al.
Published: (2025)

Sam-Guided Enhanced Fine-Grained Encoding with Mixed Semantic Learning for Medical Image Captioning
by: Zhang, Zhenyu, et al.
Published: (2023)

Region Prompt Tuning: Fine-grained Scene Text Detection Utilizing Region Text Prompt
by: Lin, Xingtao, et al.
Published: (2024)

DanceCrafter: Fine-Grained Text-Driven Controllable Dance Generation via Choreographic Syntax
by: Yuan, Hang, et al.
Published: (2026)

SpatialReward: Verifiable Spatial Reward Modeling for Fine-Grained Spatial Consistency in Text-to-Image Generation
by: Zhou, Sashuai, et al.
Published: (2026)

VideoComp: Advancing Fine-Grained Compositional and Temporal Alignment in Video-Text Models
by: Kim, Dahun, et al.
Published: (2025)

Personalized Safety Alignment for Text-to-Image Diffusion Models
by: Lei, Yu, et al.
Published: (2025)

Instant Preference Alignment for Text-to-Image Diffusion Models
by: Li, Yang, et al.
Published: (2025)

Safety Alignment Backfires: Preventing the Re-emergence of Suppressed Concepts in Fine-tuned Text-to-Image Diffusion Models
by: Kim, Sanghyun, et al.
Published: (2024)

Attributes Grouping and Mining Hashing for Fine-Grained Image Retrieval
by: Lu, Xin, et al.
Published: (2023)