Saved in:
| Main Authors: | Wang, Jiahao, Xu, Weiye, Yang, Aijun, Zhou, Wengang, Lu, Lewei, Li, Houqiang, Wang, Xiaohua, Zhu, Jinguo |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2511.10648 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025)
by: Xu, Weiye, et al.
Published: (2025)
MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning
by: Liu, Xiaoyang, et al.
Published: (2024)
by: Liu, Xiaoyang, et al.
Published: (2024)
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
by: Wang, Yonghui, et al.
Published: (2024)
by: Wang, Yonghui, et al.
Published: (2024)
Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection
by: Wang, Jiahao, et al.
Published: (2024)
by: Wang, Jiahao, et al.
Published: (2024)
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition
by: Zhao, Weichao, et al.
Published: (2024)
by: Zhao, Weichao, et al.
Published: (2024)
P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task
by: Xu, Weiye, et al.
Published: (2024)
by: Xu, Weiye, et al.
Published: (2024)
Spatial Preference Rewarding for MLLMs Spatial Understanding
by: Qiu, Han, et al.
Published: (2025)
by: Qiu, Han, et al.
Published: (2025)
Cross-Modal Consistency Learning for Sign Language Recognition
by: Wu, Kepeng, et al.
Published: (2025)
by: Wu, Kepeng, et al.
Published: (2025)
GaussNav: Gaussian Splatting for Visual Navigation
by: Lei, Xiaohan, et al.
Published: (2024)
by: Lei, Xiaohan, et al.
Published: (2024)
Revisiting Shadow Detection from a Vision-Language Perspective
by: Wang, Yonghui, et al.
Published: (2026)
by: Wang, Yonghui, et al.
Published: (2026)
StepVAR: Structure-Texture Guided Pruning for Visual Autoregressive Models
by: Liu, Keli, et al.
Published: (2026)
by: Liu, Keli, et al.
Published: (2026)
Image2Sentence based Asymmetrical Zero-shot Composed Image Retrieval
by: Du, Yongchao, et al.
Published: (2024)
by: Du, Yongchao, et al.
Published: (2024)
Visual Jigsaw Post-Training Improves MLLMs
by: Wu, Penghao, et al.
Published: (2025)
by: Wu, Penghao, et al.
Published: (2025)
LaneTCA: Enhancing Video Lane Detection with Temporal Context Aggregation
by: Zhou, Keyi, et al.
Published: (2024)
by: Zhou, Keyi, et al.
Published: (2024)
TextCoT: Zoom In for Enhanced Multimodal Text-Rich Image Understanding
by: Luan, Bozhi, et al.
Published: (2024)
by: Luan, Bozhi, et al.
Published: (2024)
SEDS: Semantically Enhanced Dual-Stream Encoder for Sign Language Retrieval
by: Jiang, Longtao, et al.
Published: (2024)
by: Jiang, Longtao, et al.
Published: (2024)
Self-Classification Enhancement and Correction for Weakly Supervised Object Detection
by: Yin, Yufei, et al.
Published: (2025)
by: Yin, Yufei, et al.
Published: (2025)
Video-based Sign Language Recognition without Temporal Segmentation
by: Huang, Jie, et al.
Published: (2018)
by: Huang, Jie, et al.
Published: (2018)
Learning Generalizable Human Motion Generator with Reinforcement Learning
by: Mao, Yunyao, et al.
Published: (2024)
by: Mao, Yunyao, et al.
Published: (2024)
StreetSurfGS: Scalable Urban Street Surface Reconstruction with Planar-based Gaussian Splatting
by: Cui, Xiao, et al.
Published: (2024)
by: Cui, Xiao, et al.
Published: (2024)
SwinShadow: Shifted Window for Ambiguous Adjacent Shadow Detection
by: Wang, Yonghui, et al.
Published: (2024)
by: Wang, Yonghui, et al.
Published: (2024)
Progressive Multi-modal Conditional Prompt Tuning
by: Qiu, Xiaoyu, et al.
Published: (2024)
by: Qiu, Xiaoyu, et al.
Published: (2024)
Motion-aware 3D Gaussian Splatting for Efficient Dynamic Scene Reconstruction
by: Guo, Zhiyang, et al.
Published: (2024)
by: Guo, Zhiyang, et al.
Published: (2024)
Structural Action Transformer for 3D Dexterous Manipulation
by: Lei, Xiaohan, et al.
Published: (2026)
by: Lei, Xiaohan, et al.
Published: (2026)
Instance-aware Exploration-Verification-Exploitation for Instance ImageGoal Navigation
by: Lei, Xiaohan, et al.
Published: (2024)
by: Lei, Xiaohan, et al.
Published: (2024)
ScaleWeaver: Weaving Efficient Controllable T2I Generation with Multi-Scale Reference Attention
by: Liu, Keli, et al.
Published: (2025)
by: Liu, Keli, et al.
Published: (2025)
SegmentDreamer: Towards High-fidelity Text-to-3D Synthesis with Segmented Consistency Trajectory Distillation
by: Zhu, Jiahao, et al.
Published: (2025)
by: Zhu, Jiahao, et al.
Published: (2025)
Exploiting GPT-4 Vision for Zero-shot Point Cloud Understanding
by: Sun, Qi, et al.
Published: (2024)
by: Sun, Qi, et al.
Published: (2024)
Reinforcing Consistency in Video MLLMs with Structured Rewards
by: Quan, Yihao, et al.
Published: (2026)
by: Quan, Yihao, et al.
Published: (2026)
Robust Multimodal Large Language Models Against Modality Conflict
by: Zhang, Zongmeng, et al.
Published: (2025)
by: Zhang, Zongmeng, et al.
Published: (2025)
SAIL-RL: Guiding MLLMs in When and How to Think via Dual-Reward RL Tuning
by: Shu, Fangxun, et al.
Published: (2025)
by: Shu, Fangxun, et al.
Published: (2025)
ROOT: VLM based System for Indoor Scene Understanding and Beyond
by: Wang, Yonghui, et al.
Published: (2024)
by: Wang, Yonghui, et al.
Published: (2024)
Forest2Seq: Revitalizing Order Prior for Sequential Indoor Scene Synthesis
by: Sun, Qi, et al.
Published: (2024)
by: Sun, Qi, et al.
Published: (2024)
Semantic Image Synthesis via Diffusion Models
by: Zhou, Wengang, et al.
Published: (2022)
by: Zhou, Wengang, et al.
Published: (2022)
Multi-Cue Adaptive Visual Token Pruning for Large Vision-Language Models
by: Luan, Bozhi, et al.
Published: (2025)
by: Luan, Bozhi, et al.
Published: (2025)
MASA: Motion-aware Masked Autoencoder with Semantic Alignment for Sign Language Recognition
by: Zhao, Weichao, et al.
Published: (2024)
by: Zhao, Weichao, et al.
Published: (2024)
DesignDiffusion: High-Quality Text-to-Design Image Generation with Diffusion Models
by: Wang, Zhendong, et al.
Published: (2025)
by: Wang, Zhendong, et al.
Published: (2025)
DeepEraser: Deep Iterative Context Mining for Generic Text Eraser
by: Feng, Hao, et al.
Published: (2024)
by: Feng, Hao, et al.
Published: (2024)
Exploiting Spatial-Temporal Context for Interacting Hand Reconstruction on Monocular RGB Video
by: Zhao, Weichao, et al.
Published: (2023)
by: Zhao, Weichao, et al.
Published: (2023)
RoFIR: Robust Fisheye Image Rectification Framework Impervious to Optical Center Deviation
by: Liao, Zhaokang, et al.
Published: (2024)
by: Liao, Zhaokang, et al.
Published: (2024)
Similar Items
-
VisuLogic: A Benchmark for Evaluating Visual Reasoning in Multi-modal Large Language Models
by: Xu, Weiye, et al.
Published: (2025) -
MotionRL: Align Text-to-Motion Generation to Human Preferences with Multi-Reward Reinforcement Learning
by: Liu, Xiaoyang, et al.
Published: (2024) -
AdaptVision: Dynamic Input Scaling in MLLMs for Versatile Scene Understanding
by: Wang, Yonghui, et al.
Published: (2024) -
Power-LLaVA: Large Language and Vision Assistant for Power Transmission Line Inspection
by: Wang, Jiahao, et al.
Published: (2024) -
Self-Supervised Representation Learning with Spatial-Temporal Consistency for Sign Language Recognition
by: Zhao, Weichao, et al.
Published: (2024)