Saved in:
| Main Authors: | Ataallah, Kirolos, Shen, Xiaoqian, Abdelrahman, Eslam, Sleiman, Essam, Zhu, Deyao, Ding, Jian, Elhoseiny, Mohamed |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.03413 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
by: Ataallah, Kirolos, et al.
Published: (2024)
by: Ataallah, Kirolos, et al.
Published: (2024)
InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
by: Ataallah, Kirolos, et al.
Published: (2024)
by: Ataallah, Kirolos, et al.
Published: (2024)
MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis
by: Alkhaldi, Asma, et al.
Published: (2024)
by: Alkhaldi, Asma, et al.
Published: (2024)
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
by: Han, Seung Hun, et al.
Published: (2026)
by: Han, Seung Hun, et al.
Published: (2026)
StoryGPT-V: Large Language Models as Consistent Story Visualizers
by: Shen, Xiaoqian, et al.
Published: (2023)
by: Shen, Xiaoqian, et al.
Published: (2023)
iMotion-LLM: Instruction-Conditioned Trajectory Generation
by: Felemban, Abdulwahab, et al.
Published: (2024)
by: Felemban, Abdulwahab, et al.
Published: (2024)
MiniGPT-Reverse-Designing: Predicting Image Adjustments Utilizing MiniGPT-4
by: Azizi, Vahid, et al.
Published: (2024)
by: Azizi, Vahid, et al.
Published: (2024)
MiniGPT-5: Interleaved Vision-and-Language Generation via Generative Vokens
by: Zheng, Kaizhi, et al.
Published: (2023)
by: Zheng, Kaizhi, et al.
Published: (2023)
Vgent: Graph-based Retrieval-Reasoning-Augmented Generation For Long Video Understanding
by: Shen, Xiaoqian, et al.
Published: (2025)
by: Shen, Xiaoqian, et al.
Published: (2025)
MiniGPT: Rebuilding GPT from First Principles
by: Joseph, Jibin
Published: (2026)
by: Joseph, Jibin
Published: (2026)
CoT3DRef: Chain-of-Thoughts Data-Efficient 3D Visual Grounding
by: Abdelrahman, Eslam, et al.
Published: (2023)
by: Abdelrahman, Eslam, et al.
Published: (2023)
Kestrel: 3D Multimodal LLM for Part-Aware Grounded Description
by: Ahmed, Mahmoud, et al.
Published: (2024)
by: Ahmed, Mahmoud, et al.
Published: (2024)
MiniGPT-Pancreas: Multimodal Large Language Model for Pancreas Cancer Classification and Detection
by: Moglia, Andrea, et al.
Published: (2024)
by: Moglia, Andrea, et al.
Published: (2024)
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025)
by: Shen, Xiaoqian, et al.
Published: (2025)
VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding
by: Li, Xiang, et al.
Published: (2024)
by: Li, Xiang, et al.
Published: (2024)
ToddlerDiffusion: Interactive Structured Image Generation with Cascaded Schrödinger Bridge
by: Abdelrahman, Eslam, et al.
Published: (2023)
by: Abdelrahman, Eslam, et al.
Published: (2023)
Mobile-VideoGPT: Fast and Accurate Model for Mobile Video Understanding
by: Shaker, Abdelrahman, et al.
Published: (2025)
by: Shaker, Abdelrahman, et al.
Published: (2025)
Time Blindness: Why Video-Language Models Can't See What Humans Can?
by: Upadhyay, Ujjwal, et al.
Published: (2025)
by: Upadhyay, Ujjwal, et al.
Published: (2025)
STORM: Token-Efficient Long Video Understanding for Multimodal LLMs
by: Jiang, Jindong, et al.
Published: (2025)
by: Jiang, Jindong, et al.
Published: (2025)
Affective Visual Dialog: A Large-Scale Benchmark for Emotional Reasoning Based on Visually Grounded Conversations
by: Haydarov, Kilichbek, et al.
Published: (2023)
by: Haydarov, Kilichbek, et al.
Published: (2023)
FishNet++: Analyzing the capabilities of Multimodal Large Language Models in marine biology
by: Khan, Faizan Farooq, et al.
Published: (2025)
by: Khan, Faizan Farooq, et al.
Published: (2025)
MiniGPT-3D: Efficiently Aligning 3D Point Clouds with Large Language Models using 2D Priors
by: Tang, Yuan, et al.
Published: (2024)
by: Tang, Yuan, et al.
Published: (2024)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
TapToTab : Video-Based Guitar Tabs Generation using AI and Audio Analysis
by: Ghaleb, Ali, et al.
Published: (2024)
by: Ghaleb, Ali, et al.
Published: (2024)
Principles of Visual Tokens for Efficient Video Understanding
by: Hao, Xinyue, et al.
Published: (2024)
by: Hao, Xinyue, et al.
Published: (2024)
VideoMathQA: Benchmarking Mathematical Reasoning via Multimodal Understanding in Videos
by: Rasheed, Hanoona, et al.
Published: (2025)
by: Rasheed, Hanoona, et al.
Published: (2025)
LongVU: Spatiotemporal Adaptive Compression for Long Video-Language Understanding
by: Shen, Xiaoqian, et al.
Published: (2024)
by: Shen, Xiaoqian, et al.
Published: (2024)
Progressive trends in prenatal genetic screening
by: Kirolos Eskandar
Published: (2022)
by: Kirolos Eskandar
Published: (2022)
Liquid biopsy in genitourinary oncology: Current clinical applications and future prospects across prostate, bladder, and renal cancers
by: Kirolos Eskandar
Published: (2025)
by: Kirolos Eskandar
Published: (2025)
Bioimpressão no Transplante de Órgãos: Dos modelos Experimentais às Perspectivas Clínicas
by: Kirolos Eskandar
Published: (2025)
by: Kirolos Eskandar
Published: (2025)
REVISOR: Beyond Textual Reflection, Towards Multimodal Introspective Reasoning in Long-Form Video Understanding
by: Li, Jiaze, et al.
Published: (2025)
by: Li, Jiaze, et al.
Published: (2025)
How Well Can Vision Language Models See Image Details?
by: Gou, Chenhui, et al.
Published: (2024)
by: Gou, Chenhui, et al.
Published: (2024)
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
by: Jung, Hoin, et al.
Published: (2026)
by: Jung, Hoin, et al.
Published: (2026)
VRU-Accident: A Vision-Language Benchmark for Video Question Answering and Dense Captioning for Accident Scene Understanding
by: Kim, Younggun, et al.
Published: (2025)
by: Kim, Younggun, et al.
Published: (2025)
Mixup Helps Understanding Multimodal Video Better
by: Ma, Xiaoyu, et al.
Published: (2025)
by: Ma, Xiaoyu, et al.
Published: (2025)
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token
by: Zhang, Shaolei, et al.
Published: (2025)
by: Zhang, Shaolei, et al.
Published: (2025)
MAGNET: A Multi-agent Framework for Finding Audio-Visual Needles by Reasoning over Multi-Video Haystacks
by: Chowdhury, Sanjoy, et al.
Published: (2025)
by: Chowdhury, Sanjoy, et al.
Published: (2025)
Benchmarking the Trustworthiness in Multimodal LLMs for Video Understanding
by: Wang, Youze, et al.
Published: (2025)
by: Wang, Youze, et al.
Published: (2025)
LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding
by: Wu, Haoning, et al.
Published: (2024)
by: Wu, Haoning, et al.
Published: (2024)
Sink-Token-Aware Pruning for Fine-Grained Video Understanding in Efficient Video LLMs
by: Kim, Kibum, et al.
Published: (2026)
by: Kim, Kibum, et al.
Published: (2026)
Similar Items
-
Goldfish: Vision-Language Understanding of Arbitrarily Long Videos
by: Ataallah, Kirolos, et al.
Published: (2024) -
InfiniBench: A Benchmark for Large Multi-Modal Models in Long-Form Movies and TV Shows
by: Ataallah, Kirolos, et al.
Published: (2024) -
MiniGPT-Med: Large Language Model as a General Interface for Radiology Diagnosis
by: Alkhaldi, Asma, et al.
Published: (2024) -
M-MiniGPT4: Multilingual VLLM Alignment via Translated Data
by: Han, Seung Hun, et al.
Published: (2026) -
StoryGPT-V: Large Language Models as Consistent Story Visualizers
by: Shen, Xiaoqian, et al.
Published: (2023)