:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Author:	Nguyen, Thong Thanh
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2602.00683
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Multi-Scale Contrastive Learning for Video Temporal Grounding
by: Nguyen, Thong Thanh, et al.
Published: (2024)

Temporal-Oriented Recipe for Transferring Large Vision-Language Model to Video Understanding
by: Nguyen, Thong, et al.
Published: (2025)

Motion-aware Contrastive Learning for Temporal Panoptic Scene Graph Generation
by: Nguyen, Thong Thanh, et al.
Published: (2024)

Encoding and Controlling Global Semantics for Long-form Video Question Answering
by: Nguyen, Thong Thanh, et al.
Published: (2024)

Lightweight Models for Emotional Analysis in Video
by: Nguyen, Quoc-Tien, et al.
Published: (2025)

MOOSE: Pay Attention to Temporal Dynamics for Video Understanding via Optical Flows
by: Nguyen, Hong, et al.
Published: (2025)

Unified Interactive Multimodal Moment Retrieval via Cascaded Embedding-Reranking and Temporal-Aware Score Fusion
by: Thanh, Toan Le Ngo, et al.
Published: (2025)

Tracking the Truth: Object-Centric Spatio-Temporal Monitoring for Video Large Language Models
by: Cao, Tri, et al.
Published: (2026)

DemaFormer: Damped Exponential Moving Average Transformer with Energy-Based Modeling for Temporal Language Grounding
by: Nguyen, Thong, et al.
Published: (2023)

One-Stage Open-Vocabulary Temporal Action Detection Leveraging Temporal Multi-scale and Action Label Features
by: Nguyen, Trung Thanh, et al.
Published: (2024)

MADTempo: An Interactive System for Multi-Event Temporal Video Retrieval with Query Augmentation
by: Vu, Huu-An, et al.
Published: (2025)

Temporal2Seq: A Unified Framework for Temporal Video Understanding Tasks
by: Yang, Min, et al.
Published: (2024)

V-CORE: Temporally Consistent Video Understanding for Video-LLM
by: Kang, Zhengjian, et al.
Published: (2026)

VideoExpert: Augmented LLM for Temporal-Sensitive Video Understanding
by: Zhao, Henghao, et al.
Published: (2025)

BTS-rPPG: Orthogonal Butterfly Temporal Shifting for Remote Photoplethysmography
by: Nguyen, Ba-Thinh, et al.
Published: (2026)

LensWalk: Agentic Video Understanding by Planning How You See in Videos
by: Li, Keliang, et al.
Published: (2026)

Seeing Through the Tool: A Controlled Benchmark for Occlusion Robustness in Foundation Segmentation Models
by: Ho, Nhan, et al.
Published: (2026)

READ: Recurrent Adapter with Partial Video-Language Alignment for Parameter-Efficient Transfer Learning in Low-Resource Video-Language Modeling
by: Nguyen, Thong, et al.
Published: (2023)

VideoLoom: A Video Large Language Model for Joint Spatial-Temporal Understanding
by: Shi, Jiapeng, et al.
Published: (2026)

Incentivizing Temporal-Awareness in Egocentric Video Understanding Models
by: Xu, Zhiyang, et al.
Published: (2026)

Multimodal Contextualized Support for Enhancing Video Retrieval System
by: Nguyen-Le, Quoc-Bao, et al.
Published: (2024)

Understanding Machine Unlearning Through the Lens of Mode Connectivity
by: Cheng, Jiali, et al.
Published: (2025)

Video-QTR: Query-Driven Temporal Reasoning Framework for Lightweight Video Understanding
by: Zhao, Xinkui, et al.
Published: (2025)

TimeLens: Rethinking Video Temporal Grounding with Multimodal LLMs
by: Zhang, Jun, et al.
Published: (2025)

SoccerLens: Grounded Soccer Video Understanding Beyond Accuracy
by: Elsharkawi, Ismael, et al.
Published: (2026)

Insect-Foundation: A Foundation Model and Large Multimodal Dataset for Vision-Language Insect Understanding
by: Truong, Thanh-Dat, et al.
Published: (2025)

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes
by: Zhou, Xingcheng, et al.
Published: (2025)

Enhancing Temporal Understanding in Video-LLMs through Stacked Temporal Attention in Vision Encoders
by: Rasekh, Ali, et al.
Published: (2025)

Med-StepBench: A Hierarchical Reasoning Framework for Evaluating Hallucinations in Medical Vision-Language Models
by: Nguyen, Minh Khoi, et al.
Published: (2026)

EgoGraph: Temporal Knowledge Graph for Egocentric Video Understanding
by: Sun, Shitong, et al.
Published: (2026)

Insect-Foundation: A Foundation Model and Large-scale 1M Dataset for Visual Insect Understanding
by: Nguyen, Hoang-Quan, et al.
Published: (2023)

T*: Re-thinking Temporal Search for Long-Form Video Understanding
by: Ye, Jinhui, et al.
Published: (2025)

STOP: Integrated Spatial-Temporal Dynamic Prompting for Video Understanding
by: Liu, Zichen, et al.
Published: (2025)

Test-Time Temporal Sampling for Efficient MLLM Video Understanding
by: Wang, Kaibin, et al.
Published: (2025)

HIG: Hierarchical Interlacement Graph Approach to Scene Graph Generation in Video Understanding
by: Nguyen, Trong-Thuan, et al.
Published: (2023)

Q-Adapter: Visual Query Adapter for Extracting Textually-related Features in Video Captioning
by: Chen, Junan, et al.
Published: (2025)

TI-JEPA: An Innovative Energy-based Joint Embedding Strategy for Text-Image Multimodal Systems
by: Vo, Khang H. N., et al.
Published: (2025)

FiLA-Video: Spatio-Temporal Compression for Fine-Grained Long Video Understanding
by: Guo, Yanan, et al.
Published: (2025)

SVBench: A Benchmark with Temporal Multi-Turn Dialogues for Streaming Video Understanding
by: Yang, Zhenyu, et al.
Published: (2025)

FrameDiT: Diffusion Transformer with Matrix Attention for Efficient Video Generation
by: Le, Minh Khoa, et al.
Published: (2026)