:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Chatterjee, Agneet, Stan, Gabriela Ben Melech, Aflalo, Estelle, Paul, Sayak, Ghosh, Dhruba, Gokhale, Tejas, Schmidt, Ludwig, Hajishirzi, Hannaneh, Lal, Vasudev, Baral, Chitta, Yang, Yezhou
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.01197
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

On the Robustness of Language Guidance for Low-Level Vision Tasks: Findings from Depth Estimation
by: Chatterjee, Agneet, et al.
Published: (2024)

REVISION: Rendering Tools Enable Spatial Fidelity in Vision-Language Models
by: Chatterjee, Agneet, et al.
Published: (2024)

Learning from Reasoning Failures via Synthetic Data Generation
by: Stan, Gabriela Ben Melech, et al.
Published: (2025)

FastRM: An efficient and automatic explainability framework for multimodal generative models
by: Stan, Gabriela Ben-Melech, et al.
Published: (2024)

FiVL: A Framework for Improved Vision-Language Alignment through the Lens of Training, Evaluation and Explainability
by: Aflalo, Estelle, et al.
Published: (2024)

ConceptBed: Evaluating Concept Learning Abilities of Text-to-Image Diffusion Models
by: Patel, Maitreya, et al.
Published: (2023)

AcT2I: Evaluating and Improving Action Depiction in Text-to-Image Models
by: Malaviya, Vatsal, et al.
Published: (2025)

Investigating VLM Hallucination from a Cognitive Psychology Perspective: A First Step Toward Interpretation with Intriguing Observations
by: Liu, Xiangrui, et al.
Published: (2025)

TextInVision: Text and Prompt Complexity Driven Visual Text Generation Benchmark
by: Fallah, Forouzan, et al.
Published: (2025)

Dual Caption Preference Optimization for Diffusion Models
by: Saeidi, Amir, et al.
Published: (2025)

Chimera: Compositional Image Generation using Part-based Concepting
by: Singh, Shivam, et al.
Published: (2025)

Grounding Stylistic Domain Generalization with Quantitative Domain Shift Measures and Synthetic Scene Images
by: Luo, Yiran, et al.
Published: (2024)

LVLM-Interpret: An Interpretability Tool for Large Vision-Language Models
by: Stan, Gabriela Ben Melech, et al.
Published: (2024)

TripletCLIP: Improving Compositional Reasoning of CLIP via Synthetic Vision-Language Negatives
by: Patel, Maitreya, et al.
Published: (2024)

VOILA: Evaluation of MLLMs For Perceptual Understanding and Analogical Reasoning
by: Yilmaz, Nilay, et al.
Published: (2025)

A Causal World Model Underlying Next Token Prediction: Exploring GPT in a Controlled Environment
by: Rohekar, Raanan Y., et al.
Published: (2024)

Stable Cinemetrics : Structured Taxonomy and Evaluation for Professional Video Generation
by: Chatterjee, Agneet, et al.
Published: (2025)

ActionCOMET: A Zero-shot Approach to Learn Image-specific Commonsense Concepts about Actions
by: Sampat, Shailaja Keyur, et al.
Published: (2024)

Investigating and Addressing Hallucinations of LLMs in Tasks Involving Negation
by: Varshney, Neeraj, et al.
Published: (2024)

$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
by: Patel, Maitreya, et al.
Published: (2024)

Debias your Large Multi-Modal Model at Test-Time via Non-Contrastive Visual Attribute Steering
by: Ratzlaff, Neale, et al.
Published: (2024)

Help Me Identify: Is an LLM+VQA System All We Need to Identify Visual Concepts?
by: Sampat, Shailaja Keyur, et al.
Published: (2024)

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
by: Ghosh, Dhruba, et al.
Published: (2026)

RefEdit: A Benchmark and Method for Improving Instruction-based Image Editing Model on Referring Expressions
by: Pathiraja, Bimsara, et al.
Published: (2025)

L-MAGIC: Language Model Assisted Generation of Images with Coherence
by: Cai, Zhipeng, et al.
Published: (2024)

Rethinking Information Synthesis in Multimodal Question Answering A Multi-Agent Perspective
by: Rajput, Krishna Singh, et al.
Published: (2025)

Harnessing Synthetic Preference Data for Enhancing Temporal Understanding of Video-LLMs
by: Vani, Sameep, et al.
Published: (2025)

EraseFlow: Learning Concept Erasure Policies via GFlowNet-Driven Alignment
by: Kusumba, Abhiram, et al.
Published: (2025)

APT: Adaptive Pruning and Tuning Pretrained Language Models for Efficient Training and Inference
by: Zhao, Bowen, et al.
Published: (2024)

Lost in Translation? Translation Errors and Challenges for Fair Assessment of Text-to-Image Models on Multilingual Concepts
by: Saxon, Michael, et al.
Published: (2024)

The Perceptual Observatory Characterizing Robustness and Grounding in MLLMs
by: Anvekar, Tejas, et al.
Published: (2025)

ViTaB-A: Evaluating Multimodal Large Language Models on Visual Table Attribution
by: Alqurnawi, Yahia, et al.
Published: (2026)

Improving Shift Invariance in Convolutional Neural Networks with Translation Invariant Polyphase Sampling
by: Saha, Sourajit, et al.
Published: (2024)

Data or Language Supervision: What Makes CLIP Better than DINO?
by: Liu, Yiming, et al.
Published: (2025)

VL-GLUE: A Suite of Fundamental yet Challenging Visuo-Linguistic Reasoning Tasks
by: Sampat, Shailaja Keyur, et al.
Published: (2024)

The EarlyBird Gets the WORM: Heuristically Accelerating EarlyBird Convergence
by: Vasudev, Adithya
Published: (2024)

ICSVR: Investigating Compositional and Syntactic Understanding in Video Retrieval Models
by: Madasu, Avinash, et al.
Published: (2023)

Critical Batch Size Revisited: A Simple Empirical Approach to Large-Batch Language Model Training
by: Merrill, William, et al.
Published: (2025)

Husky: A Unified, Open-Source Language Agent for Multi-Step Reasoning
by: Kim, Joongwon, et al.
Published: (2024)

ScienceMeter: Tracking Scientific Knowledge Updates in Language Models
by: Wang, Yike, et al.
Published: (2025)