Saved in:
| Main Authors: | , , , , , , |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.07745 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866913787518386176 |
|---|---|
| author | Hu, Yangliu Song, Zikai Feng, Na Luo, Yawei Yu, Junqing Chen, Yi-Ping Phoebe Yang, Wei |
| author_facet | Hu, Yangliu Song, Zikai Feng, Na Luo, Yawei Yu, Junqing Chen, Yi-Ping Phoebe Yang, Wei |
| contents | Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2504_07745 |
| institution | arXiv |
| publishDate | 2025 |
| record_format | arxiv |
| spellingShingle | SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding Hu, Yangliu Song, Zikai Feng, Na Luo, Yawei Yu, Junqing Chen, Yi-Ping Phoebe Yang, Wei Computer Vision and Pattern Recognition Artificial Intelligence 68T45 I.4.8; I.5 Video-based Large Language Models (Video-LLMs) have witnessed substantial advancements in recent years, propelled by the advancement in multi-modal LLMs. Although these models have demonstrated proficiency in providing the overall description of videos, they struggle with fine-grained understanding, particularly in aspects such as visual dynamics and video details inquiries. To tackle these shortcomings, we find that fine-tuning Video-LLMs on self-supervised fragment tasks, greatly improve their fine-grained video understanding abilities. Hence we propose two key contributions:(1) Self-Supervised Fragment Fine-Tuning (SF$^2$T), a novel effortless fine-tuning method, employs the rich inherent characteristics of videos for training, while unlocking more fine-grained understanding ability of Video-LLMs. Moreover, it relieves researchers from labor-intensive annotations and smartly circumvents the limitations of natural language, which often fails to capture the complex spatiotemporal variations in videos; (2) A novel benchmark dataset, namely FineVidBench, for rigorously assessing Video-LLMs' performance at both the scene and fragment levels, offering a comprehensive evaluation of their capabilities. We assessed multiple models and validated the effectiveness of SF$^2$T on them. Experimental results reveal that our approach improves their ability to capture and interpret spatiotemporal details. |
| title | SF2T: Self-supervised Fragment Finetuning of Video-LLMs for Fine-Grained Understanding |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence 68T45 I.4.8; I.5 |
| url | https://arxiv.org/abs/2504.07745 |