:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Sushko, Peter, Bharadwaj, Ayana, Lim, Zhi Yang, Ilin, Vasily, Caffee, Ben, Chen, Dongping, Salehi, Mohammadreza, Hsieh, Cheng-Yu, Krishna, Ranjay
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2502.03629
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MultiRef: Controllable Image Generation with Multiple Visual References
by: Chen, Ruoxi, et al.
Published: (2025)

VideoNet: A Large-Scale Dataset for Domain-Specific Action Recognition
by: Yadav, Tanush, et al.
Published: (2026)

Score-based deterministic density sampling
by: Ilin, Vasily, et al.
Published: (2025)

Wait, We Don't Need to "Wait"! Removing Thinking Tokens Improves Reasoning Efficiency
by: Wang, Chenlong, et al.
Published: (2025)

Perception Tokens Enhance Visual Reasoning in Multimodal Language Models
by: Bigverdi, Mahtab, et al.
Published: (2024)

The Hard Positive Truth about Vision-Language Compositionality
by: Kamath, Amita, et al.
Published: (2024)

ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition
by: Salehi, Mohammadreza, et al.
Published: (2024)

One Trajectory, One Token: Grounded Video Tokenization via Panoptic Sub-object Trajectory
by: Zheng, Chenhao, et al.
Published: (2025)

The Chronicles of RiDiC: Generating Datasets with Controlled Popularity Distribution for Long-form Factuality Evaluation
by: Braslavski, Pavel, et al.
Published: (2026)

Interleaved Scene Graphs for Interleaved Text-and-Image Generation Assessment
by: Chen, Dongping, et al.
Published: (2024)

Is C4 Dataset Optimal for Pruning? An Investigation of Calibration Data for LLM Pruning
by: Bandari, Abhinav, et al.
Published: (2024)

Lookback Lens: Detecting and Mitigating Contextual Hallucinations in Large Language Models Using Only Attention Maps
by: Chuang, Yung-Sung, et al.
Published: (2024)

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
by: Geng, Scott, et al.
Published: (2024)

Reinforced Visual Perception with Tools
by: Zhou, Zetong, et al.
Published: (2025)

Seeking and Updating with Live Visual Knowledge
by: Fu, Mingyang, et al.
Published: (2025)

OmniView: An All-Seeing Diffusion Model for 3D and 4D View Synthesis
by: Fan, Xiang, et al.
Published: (2025)

Iterated Learning Improves Compositionality in Large Vision-Language Models
by: Zheng, Chenhao, et al.
Published: (2024)

Semantic and Expressive Variation in Image Captions Across Languages
by: Ye, Andre, et al.
Published: (2023)

Agonistic Image Generation: Unsettling the Hegemony of Intention
by: Shaw, Andrew, et al.
Published: (2025)

FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
by: Hsieh, Cheng-Yu, et al.
Published: (2025)

Ablate-to-Validate: Are Vision-Language Models Really Using Continuous Thought Tokens?
by: Zhang, Tianyi, et al.
Published: (2026)

PulseReddit: A Novel Reddit Dataset for Benchmarking MAS in High-Frequency Cryptocurrency Trading
by: Han, Qiuhan, et al.
Published: (2025)

SPARO: Selective Attention for Robust and Compositional Transformer Encodings for Vision
by: Vani, Ankit, et al.
Published: (2024)

MolmoPoint: Better Pointing for VLMs with Grounding Tokens
by: Clark, Christopher, et al.
Published: (2026)

EditSleuth: A Dataset of Grounded Reasoning Chains for Image-Edit Forensics
by: Nguyen, Van-Loc, et al.
Published: (2026)

TECCI: Tricky Edits of Collected and Curated Images
by: Agrawal, Aishwarya, et al.
Published: (2026)

GalaxyEdit: Large-Scale Image Editing Dataset with Enhanced Diffusion Adapter
by: Bala, Aniruddha, et al.
Published: (2024)

From Reddit to Generative AI: Evaluating Large Language Models for Anxiety Support Fine-tuned on Social Media Data
by: Kursuncu, Ugur, et al.
Published: (2025)

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
by: Fan, Xiang, et al.
Published: (2024)

Dynamic Template Selection for Output Token Generation Optimization: MLP-Based and Transformer Approaches
by: Yadavalli, Bharadwaj
Published: (2025)

m&m's: A Benchmark to Evaluate Tool-Use for multi-step multi-modal Tasks
by: Ma, Zixian, et al.
Published: (2024)

FunEditor: Achieving Complex Image Edits via Function Aggregation with Diffusion Models
by: Samadi, Mohammadreza, et al.
Published: (2024)

ImgEdit: A Unified Image Editing Dataset and Benchmark
by: Ye, Yang, et al.
Published: (2025)

DiT4Edit: Diffusion Transformer for Image Editing
by: Feng, Kunyu, et al.
Published: (2024)

TWeddit : A Dataset of Triggering Stories Predominantly Shared by Women on Reddit
by: Bandela, Shirlene Rose, et al.
Published: (2026)

In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
by: Zhang, Zechuan, et al.
Published: (2025)

Throwaway Accounts and Moderation on Reddit
by: Guo, Cheng, et al.
Published: (2025)

ImageInWords: Unlocking Hyper-Detailed Image Descriptions
by: Garg, Roopal, et al.
Published: (2024)

The Moral Foundations Reddit Corpus
by: Trager, Jackson, et al.
Published: (2022)

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
by: Kamath, Amita, et al.
Published: (2025)