:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Zinuo, Guo, Yongxin, Liu, Jun, Zhan, Jiawei, Jiang, Xi, Wang, Chengjie, Bennamoun, Mohammed, Boussaid, Farid, Zheng, Feng, Ke, Qiuhong
Format:	Preprint
Published:	2026
Subjects:	Computation and Language
Online Access:	https://arxiv.org/abs/2604.04415
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
by: Li, Zinuo, et al.
Published: (2025)

LatentMove: Towards Complex Human Movement Video Generation
by: Taghipour, Ashkan, et al.
Published: (2025)

Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
by: Taghipour, Ashkan, et al.
Published: (2026)

Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions
by: Taghipour, Ashkan, et al.
Published: (2024)

AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
by: Zhang, Xian, et al.
Published: (2025)

DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition
by: Zhu, Jingmin, et al.
Published: (2025)

3D Brain and Heart Volume Generative Models: A Survey
by: Liu, Yanbin, et al.
Published: (2022)

Generalized Closed-form Formulae for Feature-based Subpixel Alignment in Patch-based Matching
by: Jospin, Laurent Valentin, et al.
Published: (2021)

Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering
by: Li, Haopeng, et al.
Published: (2024)

Admitting Ignorance Helps the Video Question Answering Models to Answer
by: Li, Haopeng, et al.
Published: (2025)

SVR-GS: Spatially Variant Regularization for Probabilistic Masks in 3D Gaussian Splatting
by: Taghipour, Ashkan, et al.
Published: (2025)

Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
by: Nizamani, Awais, et al.
Published: (2025)

Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation
by: Lyu, Yiheng, et al.
Published: (2025)

Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation
by: Xu, Lian, et al.
Published: (2024)

Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
by: Taghipour, Ashkan, et al.
Published: (2024)

A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
by: Khanam, Tahmina, et al.
Published: (2024)

A Riemannian Framework for the Elastic Analysis of the Spatiotemporal Variability in the Shape and Structure of Tree-like 4D Objects
by: Khanam, Tahmina, et al.
Published: (2025)

Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports
by: Li, Haopeng, et al.
Published: (2024)

Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
by: Zhu, Jingmin, et al.
Published: (2025)

Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection
by: Sagar, A S M Sharifuzzaman, et al.
Published: (2026)

SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
by: Wang, Ning, et al.
Published: (2026)

TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition
by: Liu, Yanan, et al.
Published: (2025)

UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation
by: Zhang, Chengyuan, et al.
Published: (2024)

PISTO: Proximal Inference for Stochastic Trajectory Optimization
by: Yu, Hongzhe, et al.
Published: (2026)

STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
by: Raman, Narun, et al.
Published: (2025)

Language Model Guided Interpretable Video Action Reasoning
by: Wang, Ning, et al.
Published: (2024)

Efficient Iterative Proximal Variational Inference Motion Planning
by: Chang, Zinuo, et al.
Published: (2024)

LongDiff: Training-Free Long Video Generation in One Go
by: Li, Zhuoling, et al.
Published: (2025)

Enhancing Long Video Understanding via Hierarchical Event-Based Memory
by: Cheng, Dingxin, et al.
Published: (2024)

TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)

EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction
by: Ge, Chengjie, et al.
Published: (2025)

Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
by: Miao, Bo, et al.
Published: (2024)

UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers
by: Peng, Duo, et al.
Published: (2024)

Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels
by: Marrium, Maria, et al.
Published: (2024)

Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)

Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey
by: Rajapaksha, Uchitha, et al.
Published: (2024)

STEER: Assessing the Economic Rationality of Large Language Models
by: Raman, Narun, et al.
Published: (2024)

Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
by: Wang, Ziyue, et al.
Published: (2026)

Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning
by: Li, Ke, et al.
Published: (2024)

STEER: Flexible Robotic Manipulation via Dense Language Grounding
by: Smith, Laura, et al.
Published: (2024)