Saved in:
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.05726 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913327223930880 |
|---|---|
| author | He, Bo Li, Hengduo Jang, Young Kyun Jia, Menglin Cao, Xuefei Shah, Ashish Shrivastava, Abhinav Lim, Ser-Nam |
| author_facet | He, Bo Li, Hengduo Jang, Young Kyun Jia, Menglin Cao, Xuefei Shah, Ashish Shrivastava, Abhinav Lim, Ser-Nam |
| contents | With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2404_05726 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding He, Bo Li, Hengduo Jang, Young Kyun Jia, Menglin Cao, Xuefei Shah, Ashish Shrivastava, Abhinav Lim, Ser-Nam Computer Vision and Pattern Recognition With the success of large language models (LLMs), integrating the vision model into LLMs to build vision-language foundation models has gained much more interest recently. However, existing LLM-based large multimodal models (e.g., Video-LLaMA, VideoChat) can only take in a limited number of frames for short video understanding. In this study, we mainly focus on designing an efficient and effective model for long-term video understanding. Instead of trying to process more frames simultaneously like most existing work, we propose to process videos in an online manner and store past video information in a memory bank. This allows our model to reference historical video content for long-term analysis without exceeding LLMs' context length constraints or GPU memory limits. Our memory bank can be seamlessly integrated into current multimodal LLMs in an off-the-shelf manner. We conduct extensive experiments on various video understanding tasks, such as long-video understanding, video question answering, and video captioning, and our model can achieve state-of-the-art performances across multiple datasets. Code available at https://boheumd.github.io/MA-LMM/. |
| title | MA-LMM: Memory-Augmented Large Multimodal Model for Long-Term Video Understanding |
| topic | Computer Vision and Pattern Recognition |
| url | https://arxiv.org/abs/2404.05726 |