Saved in:
| Main Authors: | Li, Zinuo, Guo, Yongxin, Liu, Jun, Zhan, Jiawei, Jiang, Xi, Wang, Chengjie, Bennamoun, Mohammed, Boussaid, Farid, Zheng, Feng, Ke, Qiuhong |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2604.04415 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
by: Li, Zinuo, et al.
Published: (2025)
by: Li, Zinuo, et al.
Published: (2025)
LatentMove: Towards Complex Human Movement Video Generation
by: Taghipour, Ashkan, et al.
Published: (2025)
by: Taghipour, Ashkan, et al.
Published: (2025)
Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
by: Taghipour, Ashkan, et al.
Published: (2026)
by: Taghipour, Ashkan, et al.
Published: (2026)
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions
by: Taghipour, Ashkan, et al.
Published: (2024)
by: Taghipour, Ashkan, et al.
Published: (2024)
AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
by: Zhang, Xian, et al.
Published: (2025)
by: Zhang, Xian, et al.
Published: (2025)
DynaPURLS: Dynamic Refinement of Part-Aware Representations for Skeleton-Based Zero-Shot Action Recognition
by: Zhu, Jingmin, et al.
Published: (2025)
by: Zhu, Jingmin, et al.
Published: (2025)
3D Brain and Heart Volume Generative Models: A Survey
by: Liu, Yanbin, et al.
Published: (2022)
by: Liu, Yanbin, et al.
Published: (2022)
Generalized Closed-form Formulae for Feature-based Subpixel Alignment in Patch-based Matching
by: Jospin, Laurent Valentin, et al.
Published: (2021)
by: Jospin, Laurent Valentin, et al.
Published: (2021)
Answering from Sure to Uncertain: Uncertainty-Aware Curriculum Learning for Video Question Answering
by: Li, Haopeng, et al.
Published: (2024)
by: Li, Haopeng, et al.
Published: (2024)
Admitting Ignorance Helps the Video Question Answering Models to Answer
by: Li, Haopeng, et al.
Published: (2025)
by: Li, Haopeng, et al.
Published: (2025)
SVR-GS: Spatially Variant Regularization for Probabilistic Masks in 3D Gaussian Splatting
by: Taghipour, Ashkan, et al.
Published: (2025)
by: Taghipour, Ashkan, et al.
Published: (2025)
Dynamic Neural Surfaces for Elastic 4D Shape Representation and Analysis
by: Nizamani, Awais, et al.
Published: (2025)
by: Nizamani, Awais, et al.
Published: (2025)
Hybrid Transformer-Mamba Architecture for Weakly Supervised Volumetric Medical Segmentation
by: Lyu, Yiheng, et al.
Published: (2025)
by: Lyu, Yiheng, et al.
Published: (2025)
Auxiliary Tasks Enhanced Dual-affinity Learning for Weakly Supervised Semantic Segmentation
by: Xu, Lian, et al.
Published: (2024)
by: Xu, Lian, et al.
Published: (2024)
Box It to Bind It: Unified Layout Control and Attribute Binding in T2I Diffusion Models
by: Taghipour, Ashkan, et al.
Published: (2024)
by: Taghipour, Ashkan, et al.
Published: (2024)
A Riemannian Approach for Spatiotemporal Analysis and Generation of 4D Tree-shaped Structures
by: Khanam, Tahmina, et al.
Published: (2024)
by: Khanam, Tahmina, et al.
Published: (2024)
A Riemannian Framework for the Elastic Analysis of the Spatiotemporal Variability in the Shape and Structure of Tree-like 4D Objects
by: Khanam, Tahmina, et al.
Published: (2025)
by: Khanam, Tahmina, et al.
Published: (2025)
Sports-QA: A Large-Scale Video Question Answering Benchmark for Complex and Professional Sports
by: Li, Haopeng, et al.
Published: (2024)
by: Li, Haopeng, et al.
Published: (2024)
Boosting Skeleton-based Zero-Shot Action Recognition with Training-Free Test-Time Adaptation
by: Zhu, Jingmin, et al.
Published: (2025)
by: Zhu, Jingmin, et al.
Published: (2025)
Fact or Fake? Assessing the Role of Deepfake Detectors in Multimodal Misinformation Detection
by: Sagar, A S M Sharifuzzaman, et al.
Published: (2026)
by: Sagar, A S M Sharifuzzaman, et al.
Published: (2026)
SkeletonContext: Skeleton-side Context Prompt Learning for Zero-Shot Skeleton-based Action Recognition
by: Wang, Ning, et al.
Published: (2026)
by: Wang, Ning, et al.
Published: (2026)
TSkel-Mamba: Temporal Dynamic Modeling via State Space Model for Human Skeleton-based Action Recognition
by: Liu, Yanan, et al.
Published: (2025)
by: Liu, Yanan, et al.
Published: (2025)
UIFormer: A Unified Transformer-based Framework for Incremental Few-Shot Object Detection and Instance Segmentation
by: Zhang, Chengyuan, et al.
Published: (2024)
by: Zhang, Chengyuan, et al.
Published: (2024)
PISTO: Proximal Inference for Stochastic Trajectory Optimization
by: Yu, Hongzhe, et al.
Published: (2026)
by: Yu, Hongzhe, et al.
Published: (2026)
STEER-ME: Assessing the Microeconomic Reasoning of Large Language Models
by: Raman, Narun, et al.
Published: (2025)
by: Raman, Narun, et al.
Published: (2025)
Language Model Guided Interpretable Video Action Reasoning
by: Wang, Ning, et al.
Published: (2024)
by: Wang, Ning, et al.
Published: (2024)
Efficient Iterative Proximal Variational Inference Motion Planning
by: Chang, Zinuo, et al.
Published: (2024)
by: Chang, Zinuo, et al.
Published: (2024)
LongDiff: Training-Free Long Video Generation in One Go
by: Li, Zhuoling, et al.
Published: (2025)
by: Li, Zhuoling, et al.
Published: (2025)
Enhancing Long Video Understanding via Hierarchical Event-Based Memory
by: Cheng, Dingxin, et al.
Published: (2024)
by: Cheng, Dingxin, et al.
Published: (2024)
TRACE: Temporal Grounding Video LLM via Causal Event Modeling
by: Guo, Yongxin, et al.
Published: (2024)
by: Guo, Yongxin, et al.
Published: (2024)
EventMamba: Enhancing Spatio-Temporal Locality with State Space Models for Event-Based Video Reconstruction
by: Ge, Chengjie, et al.
Published: (2025)
by: Ge, Chengjie, et al.
Published: (2025)
Temporally Consistent Referring Video Object Segmentation with Hybrid Memory
by: Miao, Bo, et al.
Published: (2024)
by: Miao, Bo, et al.
Published: (2024)
UPAM: Unified Prompt Attack in Text-to-Image Generation Models Against Both Textual Filters and Visual Checkers
by: Peng, Duo, et al.
Published: (2024)
by: Peng, Duo, et al.
Published: (2024)
Implicit to Explicit Entropy Regularization: Benchmarking ViT Fine-tuning under Noisy Labels
by: Marrium, Maria, et al.
Published: (2024)
by: Marrium, Maria, et al.
Published: (2024)
Omni2Sound: Towards Unified Video-Text-to-Audio Generation
by: Dai, Yusheng, et al.
Published: (2026)
by: Dai, Yusheng, et al.
Published: (2026)
Deep Learning-based Depth Estimation Methods from Monocular Image and Videos: A Comprehensive Survey
by: Rajapaksha, Uchitha, et al.
Published: (2024)
by: Rajapaksha, Uchitha, et al.
Published: (2024)
STEER: Assessing the Economic Rationality of Large Language Models
by: Raman, Narun, et al.
Published: (2024)
by: Raman, Narun, et al.
Published: (2024)
Video-KTR: Reinforcing Video Reasoning via Key Token Attribution
by: Wang, Ziyue, et al.
Published: (2026)
by: Wang, Ziyue, et al.
Published: (2026)
Human-in-the-Loop Policy Optimization for Preference-Based Multi-Objective Reinforcement Learning
by: Li, Ke, et al.
Published: (2024)
by: Li, Ke, et al.
Published: (2024)
STEER: Flexible Robotic Manipulation via Dense Language Grounding
by: Smith, Laura, et al.
Published: (2024)
by: Smith, Laura, et al.
Published: (2024)
Similar Items
-
Watch and Listen: Understanding Audio-Visual-Speech Moments with Multimodal LLM
by: Li, Zinuo, et al.
Published: (2025) -
LatentMove: Towards Complex Human Movement Video Generation
by: Taghipour, Ashkan, et al.
Published: (2025) -
Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades
by: Taghipour, Ashkan, et al.
Published: (2026) -
Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions
by: Taghipour, Ashkan, et al.
Published: (2024) -
AdaRD-key: Adaptive Relevance-Diversity Keyframe Sampling for Long-form Video understanding
by: Zhang, Xian, et al.
Published: (2025)