:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Nakada, Hyakka, Kubota, Marika
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2511.22499
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Robustness of Structured Data Extraction from Perspectively Distorted Documents
by: Nakada, Hyakka, et al.
Published: (2025)

Centered Masking for Language-Image Pre-Training
by: Liang, Mingliang, et al.
Published: (2024)

BodyShapeGPT: SMPL Body Shape Manipulation with LLMs
by: Árbol, Baldomero R., et al.
Published: (2024)

Model Interpretability and Rationale Extraction by Input Mask Optimization
by: Brinner, Marc, et al.
Published: (2025)

FisherMask: Enhancing Neural Network Labeling Efficiency in Image Classification Using Fisher Information
by: Gul, Shreen, et al.
Published: (2024)

Mostly Text, Smart Visuals: Asymmetric Text-Visual Pruning for Large Vision-Language Models
by: Li, Sijie, et al.
Published: (2026)

Reverse Stable Diffusion: What prompt was used to generate this image?
by: Croitoru, Florinel-Alin, et al.
Published: (2023)

Text-centric Alignment for Multi-Modality Learning
by: Tsai, Yun-Da, et al.
Published: (2024)

Evaluating Numerical Reasoning in Text-to-Image Models
by: Kajić, Ivana, et al.
Published: (2024)

Highlighting What Matters: Promptable Embeddings for Attribute-Focused Image Retrieval
by: Li, Siting, et al.
Published: (2025)

MAGIC: Near-Optimal Data Attribution for Deep Learning
by: Ilyas, Andrew, et al.
Published: (2025)

HATFormer: Historic Handwritten Arabic Text Recognition with Transformers
by: Chan, Adrian, et al.
Published: (2024)

Alt-Text with Context: Improving Accessibility for Images on Twitter
by: Srivatsan, Nikita, et al.
Published: (2023)

Efficient Scaling of Diffusion Transformers for Text-to-Image Generation
by: Li, Hao, et al.
Published: (2024)

Lumos : Empowering Multimodal LLMs with Scene Text Recognition
by: Shenoy, Ashish, et al.
Published: (2024)

Whats in a Video: Factorized Autoregressive Decoding for Online Dense Video Captioning
by: Piergiovanni, AJ, et al.
Published: (2024)

VPO: Aligning Text-to-Video Generation Models with Prompt Optimization
by: Cheng, Jiale, et al.
Published: (2025)

BiasConnect: Investigating Bias Interactions in Text-to-Image Models
by: Shukla, Pushkar, et al.
Published: (2025)

Glyph: Scaling Context Windows via Visual-Text Compression
by: Cheng, Jiale, et al.
Published: (2025)

T-MARS: Improving Visual Representations by Circumventing Text Feature Learning
by: Maini, Pratyush, et al.
Published: (2023)

Text-to-Image Cross-Modal Generation: A Systematic Review
by: Żelaszczyk, Maciej, et al.
Published: (2024)

Semantic Token Reweighting for Interpretable and Controllable Text Embeddings in CLIP
by: Kim, Eunji, et al.
Published: (2024)

BAD: Bidirectional Auto-regressive Diffusion for Text-to-Motion Generation
by: Hosseyni, S. Rohollah, et al.
Published: (2024)

Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
by: Yin, Shukang, et al.
Published: (2024)

Advanced Multimodal Deep Learning Architecture for Image-Text Matching
by: Wang, Jinyin, et al.
Published: (2024)

Text Role Classification in Scientific Charts Using Multimodal Transformers
by: Kim, Hye Jin, et al.
Published: (2024)

DreamReward: Text-to-3D Generation with Human Preference
by: Ye, Junliang, et al.
Published: (2024)

What If the TV Was Off? Examining Counterfactual Reasoning Abilities of Multi-modal Language Models
by: Zhang, Letian, et al.
Published: (2023)

EVE: Efficient Vision-Language Pre-training with Masked Prediction and Modality-Aware MoE
by: Chen, Junyi, et al.
Published: (2023)

FairImagen: Post-Processing for Bias Mitigation in Text-to-Image Models
by: Fu, Zihao, et al.
Published: (2025)

Text-Enhanced Data-free Approach for Federated Class-Incremental Learning
by: Tran, Minh-Tuan, et al.
Published: (2024)

T$^3$Bench: Benchmarking Current Progress in Text-to-3D Generation
by: He, Yuze, et al.
Published: (2023)

Quilt-1M: One Million Image-Text Pairs for Histopathology
by: Ikezogwo, Wisdom Oluchi, et al.
Published: (2023)

MulTaBench: Benchmarking Multimodal Tabular Learning with Text and Image
by: Arazi, Alan, et al.
Published: (2026)

A Framework For Refining Text Classification and Object Recognition from Academic Articles
by: Li, Jinghong, et al.
Published: (2023)

MoTe: Learning Motion-Text Diffusion Model for Multiple Generation Tasks
by: Wu, Yiming, et al.
Published: (2024)

PromptTA: Prompt-driven Text Adapter for Source-free Domain Generalization
by: Zhang, Haoran, et al.
Published: (2024)

BiVLC: Extending Vision-Language Compositionality Evaluation with Text-to-Image Retrieval
by: Miranda, Imanol, et al.
Published: (2024)

Ethical-Lens: Curbing Malicious Usages of Open-Source Text-to-Image Models
by: Cai, Yuzhu, et al.
Published: (2024)

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
by: Patel, Maitreya, et al.
Published: (2023)