:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Zhou, Yikang, Zhang, Tao, Gong, Dengxian, Wu, Yuanzheng, Tian, Ye, Wang, Haochen, Yuan, Haobo, Wang, Jiacong, Qi, Lu, Fei, Hao, Wang, Anran, Wang, Zhuochen, Wang, Yujing, Chen, Cheng, Ji, Shunping, Li, Xiangtai
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2601.16093
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

SaSaSaSa2VA: 2nd Place of the 5th PVUW MeViS-Text Track
by: Gong, Dengxian, et al.
Published: (2026)

The 1st Solution for 7th LSVOS RVOS Track: SaSaSa2VA
by: Niu, Quanzhu, et al.
Published: (2025)

Grasp Any Region: Towards Precise, Contextual Pixel Understanding for Multimodal LLMs
by: Wang, Haochen, et al.
Published: (2025)

DeH4R: A Decoupled and Hybrid Method for Road Network Graph Extraction
by: Gong, Dengxian, et al.
Published: (2025)

PairUni: Pairwise Training for Unified Multimodal Language Models
by: Zheng, Jiani, et al.
Published: (2025)

Open-o3-Video: Grounded Video Reasoning with Explicit Spatio-Temporal Evidence
by: Meng, Jiahao, et al.
Published: (2025)

Traceable Evidence Enhanced Visual Grounded Reasoning: Evaluation and Methodology
by: Wang, Haochen, et al.
Published: (2025)

DVIS-DAQ: Improving Video Segmentation via Dynamic Anchor Queries
by: Zhou, Yikang, et al.
Published: (2024)

Dense360: Dense Understanding from Omnidirectional Panoramas
by: Zhou, Yikang, et al.
Published: (2025)

MMaDA-Parallel: Multimodal Large Diffusion Language Models for Thinking-Aware Editing and Generation
by: Tian, Ye, et al.
Published: (2025)

Point Cloud Mamba: Point Cloud Learning via State Space Model
by: Zhang, Tao, et al.
Published: (2024)

OMG-LLaVA: Bridging Image-level, Object-level, Pixel-level Reasoning and Understanding
by: Zhang, Tao, et al.
Published: (2024)

The Scalability of Simplicity: Empirical Analysis of Vision-Language Learning with a Single Transformer
by: Lei, Weixian, et al.
Published: (2025)

Are They the Same? Exploring Visual Correspondence Shortcomings of Multimodal LLMs
by: Zhou, Yikang, et al.
Published: (2025)

Beyond Appearance: Geometric Cues for Robust Video Instance Segmentation
by: Niu, Quanzhu, et al.
Published: (2025)

Visual Reasoning Tracer: Object-Level Grounded Reasoning Benchmark
by: Yuan, Haobo, et al.
Published: (2025)

P2PFormer: A Primitive-to-polygon Method for Regular Building Contour Extraction from Remote Sensing Images
by: Zhang, Tao, et al.
Published: (2024)

Innovative methods of using information technology in teaching stringed instruments in college
by: Wang, Yuanzheng
Published: (2025)

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
by: Yuan, Haobo, et al.
Published: (2025)

Efficiently matching random inhomogeneous graphs via degree profiles
by: Ding, Jian, et al.
Published: (2023)

Conditional Panoramic Image Generation via Masked Autoregressive Modeling
by: Wang, Chaoyang, et al.
Published: (2025)

AMCEN: An Attention Masking-based Contrastive Event Network for Two-stage Temporal Knowledge Graph Reasoning
by: Yang, Jing, et al.
Published: (2024)

A Novel Shape Guided Transformer Network for Instance Segmentation in Remote Sensing Images
by: Yu, Dawen, et al.
Published: (2024)

DriveFine: Refining-Augmented Masked Diffusion VLA for Precise and Robust Driving
by: Dang, Chenxu, et al.
Published: (2026)

Segment Any 4D Gaussians
by: Ji, Shengxiang, et al.
Published: (2024)

Pixel-SAIL: Single Transformer For Pixel-Grounded Understanding
by: Zhang, Tao, et al.
Published: (2025)

Chinese ModernBERT with Whole-Word Masking
by: Zhao, Zeyu, et al.
Published: (2025)

Lysine Acetyltransferase 6 in Health and Disease
by: Yujing Tan, et al.
Published: (2025)

DreamSwapV: Mask-guided Subject Swapping for Any Customized Video Editing
by: Wang, Weitao, et al.
Published: (2025)

Towards Cross-Table Masked Pretraining for Web Data Mining
by: Ye, Chao, et al.
Published: (2023)

Preparation, Characterization and Antioxidant Effects on Processed Sausages of Ultrafine Green Tea Powder Emulsions
by: Xin Tao, et al.
Published: (2026)

Any2Caption:Interpreting Any Condition to Caption for Controllable Video Generation
by: Wu, Shengqiong, et al.
Published: (2025)

You Can't Ignore Either: Unifying Structure and Feature Denoising for Robust Graph Learning
by: Yang, Tianmeng, et al.
Published: (2024)

A note about why deep learning is deep: A discontinuous approximation perspective
by: Yongxin Li, et al.
Published: (2024)

SPICE : Leveraging Soft Probabilistic Causal Intervention for Breast Ultrasound Tumor Segmentation
by: Haobo Chen, et al.
Published: (2026)

Any Labor Union Can Represent Any Unit
Published: (2024)

Deliberative Reasoning Network: An Uncertainty-Driven Paradigm for Belief-Tracked Inference with Pretrained Language Models
by: Xu, Anran, et al.
Published: (2025)

MAGREF: Masked Guidance for Any-Reference Video Generation with Subject Disentanglement
by: Deng, Yufan, et al.
Published: (2025)

Any Model, Any Place, Any Time: Get Remote Sensing Foundation Model Embeddings On Demand
by: Ye, Dingqi, et al.
Published: (2026)

Sharp asymptotics of disconnection time of large cylinders by simple and biased random walks
by: Li, Xinyi, et al.
Published: (2024)