:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ataallah, Kirolos, Shen, Xiaoqian, Abdelrahman, Eslam, Sleiman, Essam, Zhu, Deyao, Ding, Jian, Elhoseiny, Mohamed
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2404.03413
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
by: Ataallah, Kirolos, et al.
Published: (2024)

InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
by: Ataallah, Kirolos, et al.
Published: (2024)

MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis
by: Alkhaldi, Asma, et al.
Published: (2024)

M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
by: Han, Seung Hun, et al.
Published: (2026)

StoryGPT-V: Large Language Models as Consistent Story Visualizers
by: Shen, Xiaoqian, et al.
Published: (2023)

iMotion-LLM: Instruction-Conditioned Trajectory Generation
by: Felemban, Abdulwahab, et al.
Published: (2024)

MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
by: Azizi, Vahid, et al.
Published: (2024)

MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
by: Zheng, Kaizhi, et al.
Published: (2023)

Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
by: Shen, Xiaoqian, et al.
Published: (2025)

MiniGPT: Rebuilding GPT from First Principles
by: Joseph, Jibin
Published: (2026)

CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
by: Abdelrahman, Eslam, et al.
Published: (2023)

Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
by: Ahmed, Mahmoud, et al.
Published: (2024)

MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection
by: Moglia, Andrea, et al.
Published: (2024)

Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025)

VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
by: Li, Xiang, et al.
Published: (2024)

ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge
by: Abdelrahman, Eslam, et al.
Published: (2023)

Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding
by: Shaker, Abdelrahman, et al.
Published: (2025)

Time Blindness: Why Video-Language Models Can't See What Humans Can?
by: Upadhyay, Ujjwal, et al.
Published: (2025)

STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)

Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
by: Haydarov, Kilichbek, et al.
Published: (2023)

FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology
by: Khan, Faizan Farooq, et al.
Published: (2025)

MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors
by: Tang, Yuan, et al.
Published: (2024)

Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)

TapToTab : Video-Based Guitar Tabs Generation using AI and Audio Analysis
by: Ghaleb, Ali, et al.
Published: (2024)

Principles of Visual Tokens for Efficient Video Understanding
by: Hao, Xinyue, et al.
Published: (2024)

VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)

LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
by: Shen, Xiaoqian, et al.
Published: (2024)

Progressive trends in prenatal genetic screening
by: Kirolos Eskandar
Published: (2022)

Liquid biopsy in genitourinary oncology: Current clinical applications and future prospects across prostate, bladder, and renal cancers
by: Kirolos Eskandar
Published: (2025)

Bioimpressão no Transplante de Órgãos: Dos modelos Experimentais às Perspectivas Clínicas
by: Kirolos Eskandar
Published: (2025)

REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
by: Li, Jiaze, et al.
Published: (2025)

How Well Can Vision Language Models See Image Details?
by: Gou, Chenhui, et al.
Published: (2024)

The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
by: Jung, Hoin, et al.
Published: (2026)

VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding
by: Kim, Younggun, et al.
Published: (2025)

Mixup Helps Understanding Multimodal Video Better
by: Ma, Xiaoyu, et al.
Published: (2025)

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)

MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
by: Chowdhury, Sanjoy, et al.
Published: (2025)

Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)

LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)

Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
by: Kim, Kibum, et al.
Published: (2026)