Saved in:
| Main Authors: | Dwibedi, Debidatta, Jain, Vidhi, Tompson, Jonathan, Zisserman, Andrew, Aytar, Yusuf |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.12026 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
by: Dwibedi, Debidatta, et al.
Published: (2024)
by: Dwibedi, Debidatta, et al.
Published: (2024)
A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos
by: Dwibedi, Debidatta, et al.
Published: (2024)
by: Dwibedi, Debidatta, et al.
Published: (2024)
Describe Anything: Detailed Localized Image and Video Captioning
by: Lian, Long, et al.
Published: (2025)
by: Lian, Long, et al.
Published: (2025)
Describe Anything Anywhere At Any Moment
by: Gorlo, Nicolas, et al.
Published: (2025)
by: Gorlo, Nicolas, et al.
Published: (2025)
Learning from One Continuous Video Stream
by: Carreira, João, et al.
Published: (2023)
by: Carreira, João, et al.
Published: (2023)
CapRL: Stimulating Dense Image Caption Capabilities via Reinforcement Learning
by: Xing, Long, et al.
Published: (2025)
by: Xing, Long, et al.
Published: (2025)
Describing Images $\textit{Fast and Slow}$: Quantifying and Predicting the Variation in Human Signals during Visuo-Linguistic Processes
by: Takmaz, Ece, et al.
Published: (2024)
by: Takmaz, Ece, et al.
Published: (2024)
CapsFusion: Rethinking Image-Text Data at Scale
by: Yu, Qiying, et al.
Published: (2023)
by: Yu, Qiying, et al.
Published: (2023)
PoSh: Using Scene Graphs To Guide LLMs-as-a-Judge For Detailed Image Descriptions
by: Ananthram, Amith, et al.
Published: (2025)
by: Ananthram, Amith, et al.
Published: (2025)
ANAVI: Audio Noise Awareness using Visuals of Indoor environments for NAVIgation
by: Jain, Vidhi, et al.
Published: (2024)
by: Jain, Vidhi, et al.
Published: (2024)
ChartCap: Mitigating Hallucination of Dense Chart Captioning
by: Lim, Junyoung, et al.
Published: (2025)
by: Lim, Junyoung, et al.
Published: (2025)
Open-World Object Counting in Videos
by: Amini-Naieni, Niki, et al.
Published: (2025)
by: Amini-Naieni, Niki, et al.
Published: (2025)
CapGeo: A Caption-Assisted Approach to Geometric Reasoning
by: Li, Yuying, et al.
Published: (2025)
by: Li, Yuying, et al.
Published: (2025)
Catch Me If You Can Describe Me: Open-Vocabulary Camouflaged Instance Segmentation with Diffusion
by: Vu, Tuan-Anh, et al.
Published: (2023)
by: Vu, Tuan-Anh, et al.
Published: (2023)
RadDiff: Describing Differences in Radiology Image Sets with Natural Language
by: Shen, Xiaoxian, et al.
Published: (2026)
by: Shen, Xiaoxian, et al.
Published: (2026)
LaMP-Cap: Personalized Figure Caption Generation With Multimodal Figure Profiles
by: Ng, Ho Yin 'Sam', et al.
Published: (2025)
by: Ng, Ho Yin 'Sam', et al.
Published: (2025)
New keypoint-based approach for recognising British Sign Language (BSL) from sequences
by: Deb, Oishi, et al.
Published: (2024)
by: Deb, Oishi, et al.
Published: (2024)
CLEVRER-Humans: Describing Physical and Causal Events the Human Way
by: Mao, Jiayuan, et al.
Published: (2023)
by: Mao, Jiayuan, et al.
Published: (2023)
CapArena: Benchmarking and Analyzing Detailed Image Captioning in the LLM Era
by: Cheng, Kanzhi, et al.
Published: (2025)
by: Cheng, Kanzhi, et al.
Published: (2025)
MLLMs Know Where to Look: Training-free Perception of Small Visual Details with Multimodal LLMs
by: Zhang, Jiarui, et al.
Published: (2025)
by: Zhang, Jiarui, et al.
Published: (2025)
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)
by: Hsu, Ting-Yao E., et al.
Published: (2025)
Enhancing Human-Computer Interaction in Chest X-ray Analysis using Vision and Language Model with Eye Gaze Patterns
by: Kim, Yunsoo, et al.
Published: (2024)
by: Kim, Yunsoo, et al.
Published: (2024)
Bayesian Optimization for Controlled Image Editing via LLMs
by: Cai, Chengkun, et al.
Published: (2025)
by: Cai, Chengkun, et al.
Published: (2025)
LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
by: Hirota, Yusuke, et al.
Published: (2025)
by: Hirota, Yusuke, et al.
Published: (2025)
Gated Recursive Fusion: A Stateful Approach to Scalable Multimodal Transformers
by: Shihata, Yusuf
Published: (2025)
by: Shihata, Yusuf
Published: (2025)
AutoRT: Embodied Foundation Models for Large Scale Orchestration of Robotic Agents
by: Ahn, Michael, et al.
Published: (2024)
by: Ahn, Michael, et al.
Published: (2024)
Exploring the Frontier of Vision-Language Models: A Survey of Current Methodologies and Future Directions
by: Ghosh, Akash, et al.
Published: (2024)
by: Ghosh, Akash, et al.
Published: (2024)
Outside Knowledge Conversational Video (OKCV) Dataset -- Dialoguing over Videos
by: Reichman, Benjamin, et al.
Published: (2025)
by: Reichman, Benjamin, et al.
Published: (2025)
Five Years of SciCap: What We Learned and Future Directions for Scientific Figure Captioning
by: Huang, Ting-Hao 'Kenneth', et al.
Published: (2025)
by: Huang, Ting-Hao 'Kenneth', et al.
Published: (2025)
Describe Anything in Medical Images
by: Xiao, Xi, et al.
Published: (2025)
by: Xiao, Xi, et al.
Published: (2025)
Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models
by: Chen, Lifeng, et al.
Published: (2025)
by: Chen, Lifeng, et al.
Published: (2025)
Pushing the Frontier of Black-Box LVLM Attacks via Fine-Grained Detail Targeting
by: Zhao, Xiaohan, et al.
Published: (2026)
by: Zhao, Xiaohan, et al.
Published: (2026)
DescribeEarth: Describe Anything for Remote Sensing Images
by: Li, Kaiyu, et al.
Published: (2025)
by: Li, Kaiyu, et al.
Published: (2025)
Vision-OPD: Learning to See Fine Details for Multimodal LLMs via On-Policy Self-Distillation
by: Yuan, Qianhao, et al.
Published: (2026)
by: Yuan, Qianhao, et al.
Published: (2026)
All in an Aggregated Image for In-Image Learning
by: Wang, Lei, et al.
Published: (2024)
by: Wang, Lei, et al.
Published: (2024)
FlexGen: Flexible Multi-View Generation from Text and Image Inputs
by: Xu, Xinli, et al.
Published: (2024)
by: Xu, Xinli, et al.
Published: (2024)
Segment Anything in Pathology Images with Natural Language
by: Chen, Zhixuan, et al.
Published: (2025)
by: Chen, Zhixuan, et al.
Published: (2025)
Neural Assets: 3D-Aware Multi-Object Scene Synthesis with Image Diffusion Models
by: Wu, Ziyi, et al.
Published: (2024)
by: Wu, Ziyi, et al.
Published: (2024)
M3DR: Towards Universal Multilingual Multimodal Document Retrieval
by: Kolavi, Adithya S, et al.
Published: (2025)
by: Kolavi, Adithya S, et al.
Published: (2025)
Fine-Grained Image-Text Alignment in Medical Imaging Enables Explainable Cyclic Image-Report Generation
by: Chen, Wenting, et al.
Published: (2023)
by: Chen, Wenting, et al.
Published: (2023)
Similar Items
-
OVR: A Dataset for Open Vocabulary Temporal Repetition Counting in Videos
by: Dwibedi, Debidatta, et al.
Published: (2024) -
A Short Note on Evaluating RepNet for Temporal Repetition Counting in Videos
by: Dwibedi, Debidatta, et al.
Published: (2024) -
Describe Anything: Detailed Localized Image and Video Captioning
by: Lian, Long, et al.
Published: (2025) -
Describe Anything Anywhere At Any Moment
by: Gorlo, Nicolas, et al.
Published: (2025) -
Learning from One Continuous Video Stream
by: Carreira, João, et al.
Published: (2023)