Saved in:
| Main Authors: | Yao, Xu, Kang, Lei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.03580 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent
by: Qin, Ziyuan, et al.
Published: (2024)
by: Qin, Ziyuan, et al.
Published: (2024)
Scene-Text Grounding for Text-Based Video Question Answering
by: Zhou, Sheng, et al.
Published: (2024)
by: Zhou, Sheng, et al.
Published: (2024)
Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search
by: Yuan, Yifei, et al.
Published: (2024)
by: Yuan, Yifei, et al.
Published: (2024)
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
by: Zhou, Sheng, et al.
Published: (2025)
by: Zhou, Sheng, et al.
Published: (2025)
LOVA3: Learning to Visual Question Answering, Asking and Assessment
by: Zhao, Henry Hengyuan, et al.
Published: (2024)
by: Zhao, Henry Hengyuan, et al.
Published: (2024)
VL-Reader: Vision and Language Reconstructor is an Effective Scene Text Recognizer
by: Zhong, Humen, et al.
Published: (2024)
by: Zhong, Humen, et al.
Published: (2024)
Improving Data Augmentation for Robust Visual Question Answering with Effective Curriculum Learning
by: Zheng, Yuhang, et al.
Published: (2024)
by: Zheng, Yuhang, et al.
Published: (2024)
Reading in the Dark: Low-light Scene Text Recognition
by: Fu, Xuanshuo, et al.
Published: (2026)
by: Fu, Xuanshuo, et al.
Published: (2026)
Text2Traffic: A Text-to-Image Generation and Editing Method for Traffic Scenes
by: Lv, Feng, et al.
Published: (2025)
by: Lv, Feng, et al.
Published: (2025)
A Text-Image Fusion Method with Data Augmentation Capabilities for Referring Medical Image Segmentation
by: Chai, Shurong, et al.
Published: (2025)
by: Chai, Shurong, et al.
Published: (2025)
Select-Mosaic: Data Augmentation Method for Dense Small Object Scenes
by: Zhang, Hao, et al.
Published: (2024)
by: Zhang, Hao, et al.
Published: (2024)
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions
by: Acuna, David, et al.
Published: (2025)
by: Acuna, David, et al.
Published: (2025)
The Path to Reconciling Quality and Safety in Text-to-Image Generation: Dataset, Method, and Evaluation
by: Ruan, Shouwei, et al.
Published: (2025)
by: Ruan, Shouwei, et al.
Published: (2025)
Ask2Loc: Learning to Locate Instructional Visual Answers by Asking Questions
by: Zong, Chang, et al.
Published: (2025)
by: Zong, Chang, et al.
Published: (2025)
ViConsFormer: Constituting Meaningful Phrases of Scene Texts using Transformer-based Method in Vietnamese Text-based Visual Question Answering
by: Nguyen, Nghia Hieu, et al.
Published: (2024)
by: Nguyen, Nghia Hieu, et al.
Published: (2024)
FLUX-Text: A Simple and Advanced Diffusion Transformer Baseline for Scene Text Editing
by: Lan, Rui, et al.
Published: (2025)
by: Lan, Rui, et al.
Published: (2025)
Sample-aware RandAugment: Search-free Automatic Data Augmentation for Effective Image Recognition
by: Xiao, Anqi, et al.
Published: (2025)
by: Xiao, Anqi, et al.
Published: (2025)
Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
by: Chen, Dongping, et al.
Published: (2024)
by: Chen, Dongping, et al.
Published: (2024)
A Simple Data Augmentation Strategy for Text-in-Image Scientific VQA
by: Shoer, Belal, et al.
Published: (2025)
by: Shoer, Belal, et al.
Published: (2025)
Dataset and Benchmark for Urdu Natural Scenes Text Detection, Recognition and Visual Question Answering
by: Maryam, Hiba, et al.
Published: (2024)
by: Maryam, Hiba, et al.
Published: (2024)
Relational Contrastive Learning and Masked Image Modeling for Scene Text Recognition
by: Lin, Tiancheng, et al.
Published: (2024)
by: Lin, Tiancheng, et al.
Published: (2024)
Scene-Action Prompt Fusion for Coherent Text-to-Video Storytelling
by: Kang, Taewon, et al.
Published: (2025)
by: Kang, Taewon, et al.
Published: (2025)
On the Effectiveness of Methods and Metrics for Explainable AI in Remote Sensing Image Scene Classification
by: Klotz, Jonas, et al.
Published: (2025)
by: Klotz, Jonas, et al.
Published: (2025)
GenMix: Effective Data Augmentation with Generative Diffusion Model Image Editing
by: Islam, Khawar, et al.
Published: (2024)
by: Islam, Khawar, et al.
Published: (2024)
Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
by: Yin, Shukang, et al.
Published: (2024)
by: Yin, Shukang, et al.
Published: (2024)
TextSSR: Diffusion-based Data Synthesis for Scene Text Recognition
by: Ye, Xingsong, et al.
Published: (2024)
by: Ye, Xingsong, et al.
Published: (2024)
Text-Pass Filter: An Efficient Scene Text Detector
by: Yang, Chuang, et al.
Published: (2026)
by: Yang, Chuang, et al.
Published: (2026)
PAT3D: Physics-Augmented Text-to-3D Scene Generation
by: Lin, Guying, et al.
Published: (2025)
by: Lin, Guying, et al.
Published: (2025)
Semantic Data Augmentation Enhanced Invariant Risk Minimization for Medical Image Domain Generalization
by: Zhu, Yaoyao, et al.
Published: (2025)
by: Zhu, Yaoyao, et al.
Published: (2025)
TextSculptor: Training and Benchmarking Scene Text Editing
by: Lin, Yiheng, et al.
Published: (2026)
by: Lin, Yiheng, et al.
Published: (2026)
Leveraging Text Localization for Scene Text Removal via Text-aware Masked Image Modeling
by: Wang, Zixiao, et al.
Published: (2024)
by: Wang, Zixiao, et al.
Published: (2024)
OmniScene: Attention-Augmented Multimodal 4D Scene Understanding for Autonomous Driving
by: Liu, Pei, et al.
Published: (2025)
by: Liu, Pei, et al.
Published: (2025)
Global-Local Aware Scene Text Editing
by: Yang, Fuxiang, et al.
Published: (2025)
by: Yang, Fuxiang, et al.
Published: (2025)
Not Just Pretty Pictures: Toward Interventional Data Augmentation Using Text-to-Image Generators
by: Yuan, Jianhao, et al.
Published: (2022)
by: Yuan, Jianhao, et al.
Published: (2022)
Multimodal Large Language Models for Image, Text, and Speech Data Augmentation: A Survey
by: Sapkota, Ranjan, et al.
Published: (2025)
by: Sapkota, Ranjan, et al.
Published: (2025)
Setting the Stage: Text-Driven Scene-Consistent Image Generation
by: Xie, Cong, et al.
Published: (2025)
by: Xie, Cong, et al.
Published: (2025)
Layout Agnostic Scene Text Image Synthesis with Diffusion Models
by: Zhangli, Qilong, et al.
Published: (2024)
by: Zhangli, Qilong, et al.
Published: (2024)
LEGO: Self-Supervised Representation Learning for Scene Text Images
by: Ren, Yujin, et al.
Published: (2024)
by: Ren, Yujin, et al.
Published: (2024)
TextBoost: Boosting Scene Text Fidelity in Ultra-low Bitrate Image Compression
by: Wang, Bingxin, et al.
Published: (2026)
by: Wang, Bingxin, et al.
Published: (2026)
TextVidBench: A Benchmark for Long Video Scene Text Understanding
by: Zhong, Yangyang, et al.
Published: (2025)
by: Zhong, Yangyang, et al.
Published: (2025)
Similar Items
-
Evaluating Hallucination in Text-to-Image Diffusion Models with Scene-Graph based Question-Answering Agent
by: Qin, Ziyuan, et al.
Published: (2024) -
Scene-Text Grounding for Text-Based Video Question Answering
by: Zhou, Sheng, et al.
Published: (2024) -
Asking Multimodal Clarifying Questions in Mixed-Initiative Conversational Search
by: Yuan, Yifei, et al.
Published: (2024) -
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
by: Zhou, Sheng, et al.
Published: (2025) -
LOVA3: Learning to Visual Question Answering, Asking and Assessment
by: Zhao, Henry Hengyuan, et al.
Published: (2024)