Saved in:
| Main Authors: | Samel, Karan, Sontakke, Nitish, Essa, Irfan |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2502.17352 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Exploring Efficient Foundational Multi-modal Models for Video Summarization
by: Samel, Karan, et al.
Published: (2024)
by: Samel, Karan, et al.
Published: (2024)
On the Efficacy of Text-Based Input Modalities for Action Anticipation
by: Beedu, Apoorva, et al.
Published: (2024)
by: Beedu, Apoorva, et al.
Published: (2024)
HierSum: A Global and Local Attention Mechanism for Video Summarization
by: Beedu, Apoorva, et al.
Published: (2025)
by: Beedu, Apoorva, et al.
Published: (2025)
SLAIM: Robust Dense Neural SLAM for Online Tracking and Mapping
by: Cartillier, Vincent, et al.
Published: (2024)
by: Cartillier, Vincent, et al.
Published: (2024)
3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D
by: Cartillier, Vincent, et al.
Published: (2024)
by: Cartillier, Vincent, et al.
Published: (2024)
Efficient Pre-training for Localized Instruction Generation of Videos
by: Batra, Anil, et al.
Published: (2023)
by: Batra, Anil, et al.
Published: (2023)
Why Not Use Your Textbook? Knowledge-Enhanced Procedure Planning of Instructional Videos
by: Nagasinghe, Kumaranage Ravindu Yasas, et al.
Published: (2024)
by: Nagasinghe, Kumaranage Ravindu Yasas, et al.
Published: (2024)
Open-Event Procedure Planning in Instructional Videos
by: Wu, Yilu, et al.
Published: (2024)
by: Wu, Yilu, et al.
Published: (2024)
MoCHA: Denoising Caption Supervision for Motion-Text Retrieval
by: Warner, Nikolai, et al.
Published: (2026)
by: Warner, Nikolai, et al.
Published: (2026)
Learning Procedural-aware Video Representations through State-Grounded Hierarchy Unfolding
by: Zhao, Jinghan, et al.
Published: (2025)
by: Zhao, Jinghan, et al.
Published: (2025)
ViterbiPlanNet: Injecting Procedural Knowledge via Differentiable Viterbi for Planning in Instructional Videos
by: Seminara, Luigi, et al.
Published: (2026)
by: Seminara, Luigi, et al.
Published: (2026)
UniVid: Unifying Vision Tasks with Pre-trained Video Generation Models
by: Chen, Lan, et al.
Published: (2025)
by: Chen, Lan, et al.
Published: (2025)
CamViG: Camera Aware Image-to-Video Generation with Multimodal Transformers
by: Marmon, Andrew, et al.
Published: (2024)
by: Marmon, Andrew, et al.
Published: (2024)
Predicting Implicit Arguments in Procedural Video Instructions
by: Batra, Anil, et al.
Published: (2025)
by: Batra, Anil, et al.
Published: (2025)
RECIPE: Procedural Planning via Grounding in Instructional Video
by: Seminara, Luigi, et al.
Published: (2026)
by: Seminara, Luigi, et al.
Published: (2026)
PDPP: Projected Diffusion for Procedure Planning in Instructional Videos
by: Wang, Hanlin, et al.
Published: (2023)
by: Wang, Hanlin, et al.
Published: (2023)
Mamba Fusion: Learning Actions Through Questioning
by: Dong, Zhikang, et al.
Published: (2024)
by: Dong, Zhikang, et al.
Published: (2024)
SG-MIM: Structured Knowledge Guided Efficient Pre-training for Dense Prediction
by: Son, Sumin, et al.
Published: (2024)
by: Son, Sumin, et al.
Published: (2024)
Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos
by: Zhou, Yufan, et al.
Published: (2025)
by: Zhou, Yufan, et al.
Published: (2025)
VELVET-Med: Vision and Efficient Language Pre-training for Volumetric Imaging Tasks in Medicine
by: Zhang, Ziyang, et al.
Published: (2025)
by: Zhang, Ziyang, et al.
Published: (2025)
Towards Data-Efficient Video Pre-training with Frozen Image Foundation Models
by: Orlova, Svetlana, et al.
Published: (2026)
by: Orlova, Svetlana, et al.
Published: (2026)
Understanding Multimodal Procedural Knowledge by Sequencing Multimodal Instructional Manuals
by: Wu, Te-Lin, et al.
Published: (2021)
by: Wu, Te-Lin, et al.
Published: (2021)
Contrastive Language Video Time Pre-training
by: Liu, Hengyue, et al.
Published: (2024)
by: Liu, Hengyue, et al.
Published: (2024)
EndoMamba: An Efficient Foundation Model for Endoscopic Videos via Hierarchical Pre-training
by: Tian, Qingyao, et al.
Published: (2025)
by: Tian, Qingyao, et al.
Published: (2025)
Learning Complex Non-Rigid Image Edits from Multimodal Conditioning
by: Warner, Nikolai, et al.
Published: (2024)
by: Warner, Nikolai, et al.
Published: (2024)
Leveraging Pre-trained CNNs for Efficient Feature Extraction in Rice Leaf Disease Classification
by: Sobuj, Md. Shohanur Islam, et al.
Published: (2024)
by: Sobuj, Md. Shohanur Islam, et al.
Published: (2024)
AugLift: Depth-Aware Input Reparameterization Improves Domain Generalization in 2D-to-3D Pose Lifting
by: Warner, Nikolai, et al.
Published: (2025)
by: Warner, Nikolai, et al.
Published: (2025)
LAP: A Language-Aware Planning Model For Procedure Planning In Instructional Videos
by: Shi, Lei, et al.
Published: (2026)
by: Shi, Lei, et al.
Published: (2026)
ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
by: Shi, Lei, et al.
Published: (2024)
by: Shi, Lei, et al.
Published: (2024)
Neuro Symbolic Knowledge Reasoning for Procedural Video Question Answering
by: Fernando, Basura, et al.
Published: (2025)
by: Fernando, Basura, et al.
Published: (2025)
Demo-ICL: In-Context Learning for Procedural Video Knowledge Acquisition
by: Dong, Yuhao, et al.
Published: (2026)
by: Dong, Yuhao, et al.
Published: (2026)
Mind the Interference: Retaining Pre-trained Knowledge in Parameter Efficient Continual Learning of Vision-Language Models
by: Tang, Longxiang, et al.
Published: (2024)
by: Tang, Longxiang, et al.
Published: (2024)
Generic Knowledge Boosted Pre-training For Remote Sensing Images
by: Huang, Ziyue, et al.
Published: (2024)
by: Huang, Ziyue, et al.
Published: (2024)
SNP-S3: Shared Network Pre-training and Significant Semantic Strengthening for Various Video-Text Tasks
by: Dong, Xingning, et al.
Published: (2024)
by: Dong, Xingning, et al.
Published: (2024)
Less is More: Label-Guided Summarization of Procedural and Instructional Videos
by: Rajpal, Shreya, et al.
Published: (2026)
by: Rajpal, Shreya, et al.
Published: (2026)
PTM-VQA: Efficient Video Quality Assessment Leveraging Diverse PreTrained Models from the Wild
by: Yuan, Kun, et al.
Published: (2024)
by: Yuan, Kun, et al.
Published: (2024)
Repurposing Pre-trained Video Diffusion Models for Event-based Video Interpolation
by: Chen, Jingxi, et al.
Published: (2024)
by: Chen, Jingxi, et al.
Published: (2024)
ProcObject-10K: Benchmarking Object-Centric Procedural Understanding in Instructional Videos
by: Guo, Wenliang, et al.
Published: (2025)
by: Guo, Wenliang, et al.
Published: (2025)
Temporal-Consistent Video Restoration with Pre-trained Diffusion Models
by: Wang, Hengkang, et al.
Published: (2025)
by: Wang, Hengkang, et al.
Published: (2025)
Large-scale Pre-training for Grounded Video Caption Generation
by: Kazakos, Evangelos, et al.
Published: (2025)
by: Kazakos, Evangelos, et al.
Published: (2025)
Similar Items
-
Exploring Efficient Foundational Multi-modal Models for Video Summarization
by: Samel, Karan, et al.
Published: (2024) -
On the Efficacy of Text-Based Input Modalities for Action Anticipation
by: Beedu, Apoorva, et al.
Published: (2024) -
HierSum: A Global and Local Attention Mechanism for Video Summarization
by: Beedu, Apoorva, et al.
Published: (2025) -
SLAIM: Robust Dense Neural SLAM for Online Tracking and Mapping
by: Cartillier, Vincent, et al.
Published: (2024) -
3D Semantic MapNet: Building Maps for Multi-Object Re-Identification in 3D
by: Cartillier, Vincent, et al.
Published: (2024)