:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Shi, Huafeng, Liang, Jianzhong, Xie, Rongchang, Wu, Xian, Chen, Cheng, Liu, Chang
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2505.10584
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

MUSE-VL: Modeling Unified VLM through Semantic Discrete Encoding
by: Xie, Rongchang, et al.
Published: (2024)

Degradation-Robust Fusion: An Efficient Degradation-Aware Diffusion Framework for Multimodal Image Fusion in Arbitrary Degradation Scenarios
by: Shi, Yu, et al.
Published: (2026)

Boosting Segment Anything Model to Generalize Visually Non-Salient Scenarios
by: Guo, Guangqian, et al.
Published: (2026)

DiTPainter: Efficient Video Inpainting with Diffusion Transformers
by: Wu, Xian, et al.
Published: (2025)

VideoMAC: Video Masked Autoencoders Meet ConvNets
by: Pei, Gensheng, et al.
Published: (2024)

TrajLoom: Dense Future Trajectory Generation from Video
by: Zhang, Zewei, et al.
Published: (2026)

Plenoptic Video Generation
by: Fu, Xiao, et al.
Published: (2026)

Video Streaming Thinking: VideoLLMs Can Watch and Think Simultaneously
by: Guan, Yiran, et al.
Published: (2026)

Dual-task Mutual Reinforcing Embedded Joint Video Paragraph Retrieval and Grounding
by: Wang, Mengzhao, et al.
Published: (2024)

ThinkOmni: Lifting Textual Reasoning to Omni-modal Scenarios via Guidance Decoding
by: Guan, Yiran, et al.
Published: (2026)

MME-VideoOCR: Evaluating OCR-Based Capabilities of Multimodal LLMs in Video Scenarios
by: Shi, Yang, et al.
Published: (2025)

DNA Family: Boosting Weight-Sharing NAS with Block-Wise Supervisions
by: Wang, Guangrun, et al.
Published: (2024)

SWinMamba: Serpentine Window State Space Model for Vascular Segmentation
by: Zhao, Rongchang, et al.
Published: (2025)

IndustryEQA: Pushing the Frontiers of Embodied Question Answering in Industrial Scenarios
by: Li, Yifan, et al.
Published: (2025)

SViMo: Synchronized Diffusion for Video and Motion Generation in Hand-object Interaction Scenarios
by: Dang, Lingwei, et al.
Published: (2025)

E.T. Bench: Towards Open-Ended Event-Level Video-Language Understanding
by: Liu, Ye, et al.
Published: (2024)

ALL-PET: A Low-resource and Low-shot PET Foundation Model in Projection Domain
by: Huang, Bin, et al.
Published: (2025)

NegVSR: Augmenting Negatives for Generalized Noise Modeling in Real-World Video Super-Resolution
by: Song, Yexing, et al.
Published: (2023)

Event-Causal RAG: A Retrieval-Augmented Generation Framework for Long Video Reasoning in Complex Scenarios
by: Yan, Peizheng, et al.
Published: (2026)

PresentAgent: Multimodal Agent for Presentation Video Generation
by: Shi, Jingwei, et al.
Published: (2025)

GenDDS: Generating Diverse Driving Video Scenarios with Prompt-to-Video Generative Model
by: Fu, Yongjie, et al.
Published: (2024)

StreamGVE: Training-Free Video Editing via Few-Step Streaming Video Generation
by: Jiao, Guanlong, et al.
Published: (2026)

TiViBench: Benchmarking Think-in-Video Reasoning for Video Generative Models
by: Chen, Harold Haodong, et al.
Published: (2025)

Generative Scenario Rollouts for End-to-End Autonomous Driving
by: Yasarla, Rajeev, et al.
Published: (2026)

Investigating Memorization in Video Diffusion Models
by: Chen, Chen, et al.
Published: (2024)

InstanceV: Instance-Level Video Generation
by: Chen, Yuheng, et al.
Published: (2025)

Exploring Spatiotemporal Feature Propagation for Video-Level Compressive Spectral Reconstruction: Dataset, Model and Benchmark
by: Cai, Lijing, et al.
Published: (2026)

[CLS] is Not Enough: Multi-Label Recognition via Patch-Level Inference and Adaptive Aggregation
by: Wang, Akang, et al.
Published: (2026)

Advancing Video Self-Supervised Learning via Image Foundation Models
by: Wu, Jingwei, et al.
Published: (2025)

Video-as-Answer: Predict and Generate Next Video Event with Joint-GRPO
by: Cheng, Junhao, et al.
Published: (2025)

SmoothVideo: Smooth Video Synthesis with Noise Constraints on Diffusion Models for One-shot Video Tuning
by: Peng, Liang, et al.
Published: (2023)

MSVBench: Towards Human-Level Evaluation of Multi-Shot Video Generation
by: Shi, Haoyuan, et al.
Published: (2026)

Exploring Few-Shot Defect Segmentation in General Industrial Scenarios with Metric Learning and Vision Foundation Models
by: Liu, Tongkun, et al.
Published: (2025)

VistaGEN: Consistent Driving Video Generation with Fine-Grained Control Using Multiview Visual-Language Reasoning
by: Chen, Li-Heng, et al.
Published: (2026)

MambaTrans: Multimodal Fusion Image Translation via Large Language Model Priors for Downstream Visual Tasks
by: Xu, Yushen, et al.
Published: (2025)

MCAM: Multimodal Causal Analysis Model for Ego-Vehicle-Level Driving Video Understanding
by: Cheng, Tongtong, et al.
Published: (2025)

PanFlow: Decoupled Motion Control for Panoramic Video Generation
by: Zhang, Cheng, et al.
Published: (2025)

Enhancing Physical Plausibility in Video Generation by Reasoning the Implausibility
by: Hao, Yutong, et al.
Published: (2025)

CityLLaVA: Efficient Fine-Tuning for VLMs in City Scenario
by: Duan, Zhizhao, et al.
Published: (2024)

VideoMAP: Toward Scalable Mamba-based Video Autoregressive Pretraining
by: Liu, Yunze, et al.
Published: (2025)