Saved in:
| Main Authors: | Ji, Yuyang, Shen, Yixuan, Zhu, Shengjie, Kong, Yu, Liu, Feng |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2603.26938 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment
by: Chen, Erdong, et al.
Published: (2026)
by: Chen, Erdong, et al.
Published: (2026)
Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning
by: Ji, Yuyang, et al.
Published: (2026)
by: Ji, Yuyang, et al.
Published: (2026)
IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition
by: Ji, Yuyang, et al.
Published: (2026)
by: Ji, Yuyang, et al.
Published: (2026)
RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation
by: Xue, Junxiao, et al.
Published: (2026)
by: Xue, Junxiao, et al.
Published: (2026)
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)
by: Cheng, An-Chieh, et al.
Published: (2026)
VisionCoach: Reinforcing Grounded Video Reasoning via Visual-Perception Prompting
by: Lee, Daeun, et al.
Published: (2026)
by: Lee, Daeun, et al.
Published: (2026)
From Pixels to Prose: Advancing Multi-Modal Language Models for Remote Sensing
by: Sun, Xintian, et al.
Published: (2024)
by: Sun, Xintian, et al.
Published: (2024)
Zero-Shot 3D Visual Grounding from Vision-Language Models
by: Li, Rong, et al.
Published: (2025)
by: Li, Rong, et al.
Published: (2025)
AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation
by: Zhu, Yuhan, et al.
Published: (2024)
by: Zhu, Yuhan, et al.
Published: (2024)
From Panels to Prose: Generating Literary Narratives from Comics
by: Sachdeva, Ragav, et al.
Published: (2025)
by: Sachdeva, Ragav, et al.
Published: (2025)
GenCape: Structure-Inductive Generative Modeling for Category-Agnostic Pose Estimation
by: Rao, Jiyong, et al.
Published: (2026)
by: Rao, Jiyong, et al.
Published: (2026)
BioPose: Biomechanically-accurate 3D Pose Estimation from Monocular Videos
by: Koleini, Farnoosh, et al.
Published: (2025)
by: Koleini, Farnoosh, et al.
Published: (2025)
MS-MANO: Enabling Hand Pose Tracking with Biomechanical Constraints
by: Xie, Pengfei, et al.
Published: (2024)
by: Xie, Pengfei, et al.
Published: (2024)
ChatPose: Chatting about 3D Human Pose
by: Feng, Yao, et al.
Published: (2023)
by: Feng, Yao, et al.
Published: (2023)
SKEL-CF: Coarse-to-Fine Biomechanical Skeleton and Surface Mesh Recovery
by: Li, Da, et al.
Published: (2025)
by: Li, Da, et al.
Published: (2025)
R2G: Reasoning to Ground in 3D Scenes
by: Li, Yixuan, et al.
Published: (2024)
by: Li, Yixuan, et al.
Published: (2024)
CLIPose: Category-Level Object Pose Estimation with Pre-trained Vision-Language Knowledge
by: Lin, Xiao, et al.
Published: (2024)
by: Lin, Xiao, et al.
Published: (2024)
Probablistic Restoration with Adaptive Noise Sampling for 3D Human Pose Estimation
by: Zeng, Xianzhou, et al.
Published: (2024)
by: Zeng, Xianzhou, et al.
Published: (2024)
AirZoo: A Unified Large-Scale Dataset for Grounding Aerial Geometric 3D Vision
by: Cheng, Xiaoya, et al.
Published: (2026)
by: Cheng, Xiaoya, et al.
Published: (2026)
Rethinking Pose Refinement in 3D Gaussian Splatting under Pose Prior and Geometric Uncertainty
by: Kong, Mangyu, et al.
Published: (2026)
by: Kong, Mangyu, et al.
Published: (2026)
N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
by: Wang, Yuxin, et al.
Published: (2025)
by: Wang, Yuxin, et al.
Published: (2025)
Grounding Foundational Vision Models with 3D Human Poses for Robust Action Recognition
by: Babey, Nicholas, et al.
Published: (2025)
by: Babey, Nicholas, et al.
Published: (2025)
DiffCLIP: Leveraging Stable Diffusion for Language Grounded 3D Classification
by: Shen, Sitian, et al.
Published: (2023)
by: Shen, Sitian, et al.
Published: (2023)
Marginalized Bundle Adjustment: Multi-View Camera Pose from Monocular Depth Estimates
by: Zhu, Shengjie, et al.
Published: (2026)
by: Zhu, Shengjie, et al.
Published: (2026)
From Words to Poses: Enhancing Novel Object Pose Estimation with Vision Language Models
by: Pulli, Tessa, et al.
Published: (2024)
by: Pulli, Tessa, et al.
Published: (2024)
From 2D CAD Drawings to 3D Parametric Models: A Vision-Language Approach
by: Wang, Xilin, et al.
Published: (2024)
by: Wang, Xilin, et al.
Published: (2024)
OpenCapBench: A Benchmark to Bridge Pose Estimation and Biomechanics
by: Gozlan, Yoni, et al.
Published: (2024)
by: Gozlan, Yoni, et al.
Published: (2024)
From Skin to Skeleton: Towards Biomechanically Accurate 3D Digital Humans
by: Keller, Marilyn, et al.
Published: (2025)
by: Keller, Marilyn, et al.
Published: (2025)
HeRO: Hierarchical 3D Semantic Representation for Pose-aware Object Manipulation
by: Xu, Chongyang, et al.
Published: (2026)
by: Xu, Chongyang, et al.
Published: (2026)
HPE-CogVLM: Advancing Vision Language Models with a Head Pose Grounding Task
by: Tian, Yu, et al.
Published: (2024)
by: Tian, Yu, et al.
Published: (2024)
TechCoach: Towards Technical-Point-Aware Descriptive Action Coaching
by: Li, Yuan-Ming, et al.
Published: (2024)
by: Li, Yuan-Ming, et al.
Published: (2024)
PoseFix: Correcting 3D Human Poses with Natural Language
by: Delmas, Ginger, et al.
Published: (2023)
by: Delmas, Ginger, et al.
Published: (2023)
PoseScript: Linking 3D Human Poses and Natural Language
by: Delmas, Ginger, et al.
Published: (2022)
by: Delmas, Ginger, et al.
Published: (2022)
Learning Consistent Temporal Grounding between Related Tasks in Sports Coaching
by: Rai, Arushi, et al.
Published: (2026)
by: Rai, Arushi, et al.
Published: (2026)
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-Centric 3D Visual Grounding
by: Zheng, Henry, et al.
Published: (2025)
by: Zheng, Henry, et al.
Published: (2025)
HiSpatial: Taming Hierarchical 3D Spatial Understanding in Vision-Language Models
by: Liang, Huizhi, et al.
Published: (2026)
by: Liang, Huizhi, et al.
Published: (2026)
CLIP-GS: Unifying Vision-Language Representation with 3D Gaussian Splatting
by: Jiao, Siyu, et al.
Published: (2024)
by: Jiao, Siyu, et al.
Published: (2024)
See4D: Pose-Free 4D Generation via Auto-Regressive Video Inpainting
by: Lu, Dongyue, et al.
Published: (2025)
by: Lu, Dongyue, et al.
Published: (2025)
Revisit Self-supervised Depth Estimation with Local Structure-from-Motion
by: Zhu, Shengjie, et al.
Published: (2024)
by: Zhu, Shengjie, et al.
Published: (2024)
From Pixels to Prose: A Large Dataset of Dense Image Captions
by: Singla, Vasu, et al.
Published: (2024)
by: Singla, Vasu, et al.
Published: (2024)
Similar Items
-
BioGait-VLM: A Tri-Modal Vision-Language-Biomechanics Framework for Interpretable Clinical Gait Assessment
by: Chen, Erdong, et al.
Published: (2026) -
Non-Colliding Biometric Identities for Digital Entities: Geometry, Capacity, and Million-Scale Virtual Identity Provisioning
by: Ji, Yuyang, et al.
Published: (2026) -
IDSelect: A RL-Based Cost-Aware Selection Agent for Video-based Multi-Modal Person Recognition
by: Ji, Yuyang, et al.
Published: (2026) -
RePose: A Real-Time 3D Human Pose Estimation and Biomechanical Analysis Framework for Rehabilitation
by: Xue, Junxiao, et al.
Published: (2026) -
Grounded 3D-Aware Spatial Vision-Language Modeling
by: Cheng, An-Chieh, et al.
Published: (2026)