:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Dwibedi, Debidatta, Jain, Vidhi, Tompson, Jonathan, Zisserman, Andrew, Aytar, Yusuf
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2403.12026
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
by: Dwibedi, Debidatta, et al.
Published: (2024)

A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos
by: Dwibedi, Debidatta, et al.
Published: (2024)

Describe Anything: Detailed Localized Image and Video Captioning
by: Lian, Long, et al.
Published: (2025)

Describe Anything Anywhere At Any Moment
by: Gorlo, Nicolas, et al.
Published: (2025)

Learning from One Continuous Video Stream
by: Carreira, João, et al.
Published: (2023)

CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)

Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes
by: Takmaz, Ece, et al.
Published: (2024)

CapsFusion: Rethinking Image-Text Data at Scale
by: Yu, Qiying, et al.
Published: (2023)

PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
by: Ananthram, Amith, et al.
Published: (2025)

ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation
by: Jain, Vidhi, et al.
Published: (2024)

ChartCap: Mitigating Hallucination of Dense Chart Captioning
by: Lim, Junyoung, et al.
Published: (2025)

Open-World Object Counting in Videos
by: Amini-Naieni, Niki, et al.
Published: (2025)

CapGeo: A Caption-Assisted Approach to Geometric Reasoning
by: Li, Yuying, et al.
Published: (2025)

Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion
by: Vu, Tuan-Anh, et al.
Published: (2023)

RadDiff: Describing Differences in Radiology Image Sets with Natural Language
by: Shen, Xiaoxian, et al.
Published: (2026)

LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
by: Ng, Ho Yin 'Sam', et al.
Published: (2025)

New keypoint-based approach for recognising British Sign Language (BSL) from sequences
by: Deb, Oishi, et al.
Published: (2024)

CLEVRER-Humans: Describing Physical and Causal Events the Human Way
by: Mao, Jiayuan, et al.
Published: (2023)

CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
by: Cheng, Kanzhi, et al.
Published: (2025)

MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
by: Zhang, Jiarui, et al.
Published: (2025)

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)

Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns
by: Kim, Yunsoo, et al.
Published: (2024)

Bayesian Optimization for Controlled Image Editing via LLMs
by: Cai, Chengkun, et al.
Published: (2025)

LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
by: Hirota, Yusuke, et al.
Published: (2025)

Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers
by: Shihata, Yusuf
Published: (2025)

AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
by: Ahn, Michael, et al.
Published: (2024)

Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
by: Ghosh, Akash, et al.
Published: (2024)

Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
by: Reichman, Benjamin, et al.
Published: (2025)

Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning
by: Huang, Ting-Hao 'Kenneth', et al.
Published: (2025)

Describe Anything in Medical Images
by: Xiao, Xi, et al.
Published: (2025)

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models
by: Chen, Lifeng, et al.
Published: (2025)

Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
by: Zhao, Xiaohan, et al.
Published: (2026)

DescribeEarth: Describe Anything for Remote Sensing Images
by: Li, Kaiyu, et al.
Published: (2025)

Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
by: Yuan, Qianhao, et al.
Published: (2026)

All in an Aggregated Image for In-Image Learning
by: Wang, Lei, et al.
Published: (2024)

FlexGen: Flexible Multi-View Generation from Text and Image Inputs
by: Xu, Xinli, et al.
Published: (2024)

Segment Anything in Pathology Images with Natural Language
by: Chen, Zhixuan, et al.
Published: (2025)

Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
by: Wu, Ziyi, et al.
Published: (2024)

M3DR: Towards Universal Multilingual Multimodal Document Retrieval
by: Kolavi, Adithya S, et al.
Published: (2025)

Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation
by: Chen, Wenting, et al.
Published: (2023)