Saved in:
| Main Authors: | Wu, Tz-Ying, Trigui, Tahani, Sridhar, Sharath Nittur, Bodas, Anand, Tripathi, Subarna |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.17050 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Harnessing Object Grounding for Time-Sensitive Video Understanding
by: Wu, Tz-Ying, et al.
Published: (2025)
by: Wu, Tz-Ying, et al.
Published: (2025)
EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs
by: Rodin, Ivan, et al.
Published: (2025)
by: Rodin, Ivan, et al.
Published: (2025)
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis
by: Dipta, Shubhashis Roy, et al.
Published: (2025)
by: Dipta, Shubhashis Roy, et al.
Published: (2025)
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
by: Wu, Tz-Ying, et al.
Published: (2024)
by: Wu, Tz-Ying, et al.
Published: (2024)
Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
by: Liu, Sainan, et al.
Published: (2026)
by: Liu, Sainan, et al.
Published: (2026)
LLaMA-NAS: Efficient Neural Architecture Search for Large Language Models
by: Sarah, Anthony, et al.
Published: (2024)
by: Sarah, Anthony, et al.
Published: (2024)
LVLM-Compress-Bench: Benchmarking the Broader Impact of Large Vision-Language Model Compression
by: Kundu, Souvik, et al.
Published: (2025)
by: Kundu, Souvik, et al.
Published: (2025)
VideoSAGE: Video Summarization with Graph Representation Learning
by: Chaves, Jose M. Rojas, et al.
Published: (2024)
by: Chaves, Jose M. Rojas, et al.
Published: (2024)
Improving the Throughput of Diffusion-based Large Language Models via a Training-Free Confidence-Aware Calibration
by: Shen, Jucheng, et al.
Published: (2025)
by: Shen, Jucheng, et al.
Published: (2025)
Contrastive Language Video Time Pre-training
by: Liu, Hengyue, et al.
Published: (2024)
by: Liu, Hengyue, et al.
Published: (2024)
PALADIN : Robust Neural Fingerprinting for Text-to-Image Diffusion Models
by: L, Murthy, et al.
Published: (2025)
by: L, Murthy, et al.
Published: (2025)
SViTT-Ego: A Sparse Video-Text Transformer for Egocentric Video
by: Valdez, Hector A., et al.
Published: (2024)
by: Valdez, Hector A., et al.
Published: (2024)
Graph-Based Multimodal and Multi-view Alignment for Keystep Recognition
by: Romero, Julia Lee, et al.
Published: (2025)
by: Romero, Julia Lee, et al.
Published: (2025)
Towards Training-free Multimodal Hate Localisation with Large Language Models
by: Sun, Yueming, et al.
Published: (2026)
by: Sun, Yueming, et al.
Published: (2026)
TrajPred: Trajectory-Conditioned Joint Embedding Prediction for Surgical Instrument-Tissue Interaction Recognition in Vision-Language Models
by: Cheng, Jiajun, et al.
Published: (2026)
by: Cheng, Jiajun, et al.
Published: (2026)
freePruner: A Training-free Approach for Large Multimodal Model Acceleration
by: Xu, Bingxin, et al.
Published: (2024)
by: Xu, Bingxin, et al.
Published: (2024)
ProTeCt: Prompt Tuning for Taxonomic Open Set Classification
by: Wu, Tz-Ying, et al.
Published: (2023)
by: Wu, Tz-Ying, et al.
Published: (2023)
ByDeWay: Boost Your multimodal LLM with DEpth prompting in a Training-Free Way
by: Roy, Rajarshi, et al.
Published: (2025)
by: Roy, Rajarshi, et al.
Published: (2025)
Towards Language-Driven Video Inpainting via Multimodal Large Language Models
by: Wu, Jianzong, et al.
Published: (2024)
by: Wu, Jianzong, et al.
Published: (2024)
Harnessing Large Language Models for Training-free Video Anomaly Detection
by: Zanella, Luca, et al.
Published: (2024)
by: Zanella, Luca, et al.
Published: (2024)
Keystep Recognition using Graph Neural Networks
by: Romero, Julia Lee, et al.
Published: (2025)
by: Romero, Julia Lee, et al.
Published: (2025)
Multimodal Cinematic Video Synthesis Using Text-to-Image and Audio Generation Models
by: S, Sridhar, et al.
Published: (2025)
by: S, Sridhar, et al.
Published: (2025)
Multimodal Query-guided Object Localization
by: Tripathi, Aditay, et al.
Published: (2022)
by: Tripathi, Aditay, et al.
Published: (2022)
VideoMerge: Towards Training-free Long Video Generation
by: Zhang, Siyang, et al.
Published: (2025)
by: Zhang, Siyang, et al.
Published: (2025)
FlowNar: Scalable Streaming Narration for Long-Form Videos
by: Zhong, Zeyun, et al.
Published: (2026)
by: Zhong, Zeyun, et al.
Published: (2026)
SAISA: Towards Multimodal Large Language Models with Both Training and Inference Efficiency
by: Yuan, Qianhao, et al.
Published: (2025)
by: Yuan, Qianhao, et al.
Published: (2025)
Towards Effective Long Video Understanding of Multimodal Large Language Models via One-shot Clip Retrieval
by: Chen, Tao, et al.
Published: (2025)
by: Chen, Tao, et al.
Published: (2025)
Objects in Generated Videos Are Slower Than They Appear: Models Suffer Sub-Earth Gravity and Don't Know Galileo's Principle...for now
by: Thozhiyoor, Varun Varma, et al.
Published: (2025)
by: Thozhiyoor, Varun Varma, et al.
Published: (2025)
Video-OPD: Efficient Post-Training of Multimodal Large Language Models for Temporal Video Grounding via On-Policy Distillation
by: Li, Jiaze, et al.
Published: (2026)
by: Li, Jiaze, et al.
Published: (2026)
A Survey on Video Temporal Grounding with Multimodal Large Language Model
by: Wu, Jianlong, et al.
Published: (2025)
by: Wu, Jianlong, et al.
Published: (2025)
MindVL: Towards Efficient and Effective Training of Multimodal Large Language Models on Ascend NPUs
by: Chen, Feilong, et al.
Published: (2025)
by: Chen, Feilong, et al.
Published: (2025)
Towards Training-free Anomaly Detection with Vision and Language Foundation Models
by: Zhang, Jinjin, et al.
Published: (2025)
by: Zhang, Jinjin, et al.
Published: (2025)
VideoAuteur: Towards Long Narrative Video Generation
by: Xiao, Junfei, et al.
Published: (2025)
by: Xiao, Junfei, et al.
Published: (2025)
Video-LMM Post-Training: A Deep Dive into Video Reasoning with Large Multimodal Models
by: Tang, Yolo Y., et al.
Published: (2025)
by: Tang, Yolo Y., et al.
Published: (2025)
VADER: Towards Causal Video Anomaly Understanding with Relation-Aware Large Language Models
by: Cheng, Ying, et al.
Published: (2025)
by: Cheng, Ying, et al.
Published: (2025)
Enhancing Model Performance: Another Approach to Vision-Language Instruction Tuning
by: Vedanshu, et al.
Published: (2024)
by: Vedanshu, et al.
Published: (2024)
Training-free Video Temporal Grounding using Large-scale Pre-trained Models
by: Zheng, Minghang, et al.
Published: (2024)
by: Zheng, Minghang, et al.
Published: (2024)
SCaRL- A Synthetic Multi-Modal Dataset for Autonomous Driving
by: Ramesh, Avinash Nittur, et al.
Published: (2024)
by: Ramesh, Avinash Nittur, et al.
Published: (2024)
Toward Cognitive Supersensing in Multimodal Large Language Model
by: Li, Boyi, et al.
Published: (2026)
by: Li, Boyi, et al.
Published: (2026)
Revolutionizing Mixed Precision Quantization: Towards Training-free Automatic Proxy Discovery via Large Language Models
by: Kang, Haidong, et al.
Published: (2025)
by: Kang, Haidong, et al.
Published: (2025)
Similar Items
-
Harnessing Object Grounding for Time-Sensitive Video Understanding
by: Wu, Tz-Ying, et al.
Published: (2025) -
EASG-Bench: Video Q&A Benchmark with Egocentric Action Scene Graphs
by: Rodin, Ivan, et al.
Published: (2025) -
VC-Inspector: Advancing Reference-free Evaluation of Video Captions with Factual Analysis
by: Dipta, Shubhashis Roy, et al.
Published: (2025) -
Ego-VPA: Egocentric Video Understanding with Parameter-efficient Adaptation
by: Wu, Tz-Ying, et al.
Published: (2024) -
Search2Motion: Training-Free Object-Level Motion Control via Attention-Consensus Search
by: Liu, Sainan, et al.
Published: (2026)