Saved in:
| Main Authors: | Lu, Sicheng, Xiao, Zikai, Wei, Jianhui, Sun, Danyu, Lu, Qi, Hu, Keli, Feng, Yang, Wu, Jian, Yang, Zongxin, Liu, Zuozhu |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.29966 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis
by: Wei, Jianhui, et al.
Published: (2025)
by: Wei, Jianhui, et al.
Published: (2025)
PX2Tooth: Reconstructing the 3D Point Cloud Teeth from a Single Panoramic X-ray
by: Ma, Wen, et al.
Published: (2024)
by: Ma, Wen, et al.
Published: (2024)
Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units
by: Ling, Zhe, et al.
Published: (2026)
by: Ling, Zhe, et al.
Published: (2026)
UniVBench: Towards Unified Evaluation for Video Foundation Models
by: Wei, Jianhui, et al.
Published: (2026)
by: Wei, Jianhui, et al.
Published: (2026)
HICT: High-precision 3D CBCT reconstruction from a single X-ray
by: Ma, Wen, et al.
Published: (2026)
by: Ma, Wen, et al.
Published: (2026)
In-Context Edit: Enabling Instructional Image Editing with In-Context Generation in Large Scale Diffusion Transformer
by: Zhang, Zechuan, et al.
Published: (2025)
by: Zhang, Zechuan, et al.
Published: (2025)
PBE-UNet: A light weight Progressive Boundary-Enhanced U-Net with Scale-Aware Aggregation for Ultrasound Image Segmentation
by: Wang, Chen, et al.
Published: (2026)
by: Wang, Chen, et al.
Published: (2026)
Are Image-to-Video Models Good Zero-Shot Image Editors?
by: Zhang, Zechuan, et al.
Published: (2025)
by: Zhang, Zechuan, et al.
Published: (2025)
TRACE: High-Fidelity 3D Scene Editing via Tangible Reconstruction and Geometry-Aligned Contextual Video Masking
by: Hu, Jiyuan, et al.
Published: (2026)
by: Hu, Jiyuan, et al.
Published: (2026)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
DreamRenderer: Taming Multi-Instance Attribute Control in Large-Scale Text-to-Image Models
by: Zhou, Dewei, et al.
Published: (2025)
by: Zhou, Dewei, et al.
Published: (2025)
From Pretrain to Pain: Adversarial Vulnerability of Video Foundation Models Without Task Knowledge
by: Lu, Hui, et al.
Published: (2025)
by: Lu, Hui, et al.
Published: (2025)
Streaming Video Understanding and Multi-round Interaction with Memory-enhanced Knowledge
by: Xiong, Haomiao, et al.
Published: (2025)
by: Xiong, Haomiao, et al.
Published: (2025)
Med-R2: An Adversarial Benchmark for Evidence-Grounded Reasoning in Medical VLMs
by: Ma, Wen, et al.
Published: (2026)
by: Ma, Wen, et al.
Published: (2026)
Scalable Video Object Segmentation with Identification Mechanism
by: Yang, Zongxin, et al.
Published: (2022)
by: Yang, Zongxin, et al.
Published: (2022)
IDPro: Flexible Interactive Video Object Segmentation by ID-queried Concurrent Propagation
by: Li, Kexin, et al.
Published: (2024)
by: Li, Kexin, et al.
Published: (2024)
Origin Identification for Text-Guided Image-to-Image Diffusion Models
by: Wang, Wenhao, et al.
Published: (2025)
by: Wang, Wenhao, et al.
Published: (2025)
GD^2-NeRF: Generative Detail Compensation via GAN and Diffusion for One-shot Generalizable Neural Radiance Fields
by: Pan, Xiao, et al.
Published: (2024)
by: Pan, Xiao, et al.
Published: (2024)
Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation
by: Liang, Chen, et al.
Published: (2021)
by: Liang, Chen, et al.
Published: (2021)
Visual Instruction Pretraining for Domain-Specific Foundation Models
by: Li, Yuxuan, et al.
Published: (2025)
by: Li, Yuxuan, et al.
Published: (2025)
MindCine: Multimodal EEG-to-Video Reconstruction with Large-Scale Pretrained Models
by: Zhou, Tian-Yi, et al.
Published: (2026)
by: Zhou, Tian-Yi, et al.
Published: (2026)
SurgMotion: A Video-Native Foundation Model for Universal Understanding of Surgical Videos
by: Wu, Jinlin, et al.
Published: (2026)
by: Wu, Jinlin, et al.
Published: (2026)
VAEVQ: Enhancing Discrete Visual Tokenization through Variational Modeling
by: Yang, Sicheng, et al.
Published: (2025)
by: Yang, Sicheng, et al.
Published: (2025)
DoraemonGPT: Toward Understanding Dynamic Scenes with Large Language Models (Exemplified as A Video Agent)
by: Yang, Zongxin, et al.
Published: (2024)
by: Yang, Zongxin, et al.
Published: (2024)
MedThink: Explaining Medical Visual Question Answering via Multimodal Decision-Making Rationale
by: Gai, Xiaotang, et al.
Published: (2024)
by: Gai, Xiaotang, et al.
Published: (2024)
Scaling Dense Event-Stream Pretraining from Visual Foundation Models
by: Chen, Zhiwen, et al.
Published: (2026)
by: Chen, Zhiwen, et al.
Published: (2026)
SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction
by: Zhang, Zechuan, et al.
Published: (2023)
by: Zhang, Zechuan, et al.
Published: (2023)
3D Object Manipulation in a Single Image using Generative Models
by: Zhao, Ruisi, et al.
Published: (2025)
by: Zhao, Ruisi, et al.
Published: (2025)
Video Anomaly Detection with Motion and Appearance Guided Patch Diffusion Model
by: Zhou, Hang, et al.
Published: (2024)
by: Zhou, Hang, et al.
Published: (2024)
How Far Are Video Models from True Multimodal Reasoning?
by: Zhang, Xiaotian, et al.
Published: (2026)
by: Zhang, Xiaotian, et al.
Published: (2026)
Replication in Visual Diffusion Models: A Survey and Outlook
by: Wang, Wenhao, et al.
Published: (2024)
by: Wang, Wenhao, et al.
Published: (2024)
Learning to Drive is a Free Gift: Large-Scale Label-Free Autonomy Pretraining from Unposed In-The-Wild Videos
by: Strong, Matthew, et al.
Published: (2026)
by: Strong, Matthew, et al.
Published: (2026)
Opening the Black Box: Preliminary Insights into Affective Modeling in Multimodal Foundation Models
by: Zhang, Zhen, et al.
Published: (2026)
by: Zhang, Zhen, et al.
Published: (2026)
Streaming Video Diffusion: Online Video Editing with Diffusion Models
by: Chen, Feng, et al.
Published: (2024)
by: Chen, Feng, et al.
Published: (2024)
Harvest Video Foundation Models via Efficient Post-Pretraining
by: Li, Yizhuo, et al.
Published: (2023)
by: Li, Yizhuo, et al.
Published: (2023)
HSCR: Hierarchical Self-Contrastive Rewarding for Aligning Medical Vision Language Models
by: Jiang, Songtao, et al.
Published: (2025)
by: Jiang, Songtao, et al.
Published: (2025)
Systematic Evaluation and Guidelines for Segment Anything Model in Surgical Video Analysis
by: Yuan, Cheng, et al.
Published: (2024)
by: Yuan, Cheng, et al.
Published: (2024)
Realistic Surgical Simulation from Monocular Videos
by: Wang, Kailing, et al.
Published: (2024)
by: Wang, Kailing, et al.
Published: (2024)
OphCLIP: Hierarchical Retrieval-Augmented Learning for Ophthalmic Surgical Video-Language Pretraining
by: Hu, Ming, et al.
Published: (2024)
by: Hu, Ming, et al.
Published: (2024)
Seaweed-7B: Cost-Effective Training of Video Generation Foundation Model
by: Seawead, Team, et al.
Published: (2025)
by: Seawead, Team, et al.
Published: (2025)
Similar Items
-
SurgBench: A Unified Large-Scale Benchmark for Surgical Video Analysis
by: Wei, Jianhui, et al.
Published: (2025) -
PX2Tooth: Reconstructing the 3D Point Cloud Teeth from a Single Panoramic X-ray
by: Ma, Wen, et al.
Published: (2024) -
Rotation-free Online Handwritten Character Recognition Using Linear Recurrent Units
by: Ling, Zhe, et al.
Published: (2026) -
UniVBench: Towards Unified Evaluation for Video Foundation Models
by: Wei, Jianhui, et al.
Published: (2026) -
HICT: High-precision 3D CBCT reconstruction from a single X-ray
by: Ma, Wen, et al.
Published: (2026)