Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.14724 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1866929509903630336 |
|---|---|
| author | Prasad, Ashish Jeevan, Pranav Sethi, Amit |
| author_facet | Prasad, Ashish Jeevan, Pranav Sethi, Amit |
| contents | Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks. |
| format | Preprint |
| id |
arxiv_https___arxiv_org_abs_2409_14724 |
| institution | arXiv |
| publishDate | 2024 |
| record_format | arxiv |
| spellingShingle | EDSNet: Efficient-DSNet for Video Summarization Prasad, Ashish Jeevan, Pranav Sethi, Amit Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning I.4.10; I.4.0; I.4.9; I.2.10 Current video summarization methods largely rely on transformer-based architectures, which, due to their quadratic complexity, require substantial computational resources. In this work, we address these inefficiencies by enhancing the Direct-to-Summarize Network (DSNet) with more resource-efficient token mixing mechanisms. We show that replacing traditional attention with alternatives like Fourier, Wavelet transforms, and Nyströmformer improves efficiency and performance. Furthermore, we explore various pooling strategies within the Regional Proposal Network, including ROI pooling, Fast Fourier Transform pooling, and flat pooling. Our experimental results on TVSum and SumMe datasets demonstrate that these modifications significantly reduce computational costs while maintaining competitive summarization performance. Thus, our work offers a more scalable solution for video summarization tasks. |
| title | EDSNet: Efficient-DSNet for Video Summarization |
| topic | Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning I.4.10; I.4.0; I.4.9; I.2.10 |
| url | https://arxiv.org/abs/2409.14724 |