Saved in:
| Main Authors: | Yang, Chiao-An, Hachiuma, Ryo, Liu, Sifei, Radhakrishnan, Subhashree, Yeh, Raymond A., Wang, Yu-Chiang Frank, Chen, Min-Hung |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2512.17012 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
by: Heo, Miran, et al.
Published: (2025)
by: Heo, Miran, et al.
Published: (2025)
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025)
by: Shen, Xiaoqian, et al.
Published: (2025)
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025)
by: Lee, Byung-Kwan, et al.
Published: (2025)
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024)
by: Cheng, An-Chieh, et al.
Published: (2024)
Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention
by: Zhang, Haomeng, et al.
Published: (2024)
by: Zhang, Haomeng, et al.
Published: (2024)
SANER: Annotation-free Societal Attribute Neutralizer for Debiasing CLIP
by: Hirota, Yusuke, et al.
Published: (2024)
by: Hirota, Yusuke, et al.
Published: (2024)
3D Aware Region Prompted Vision Language Model
by: Cheng, An-Chieh, et al.
Published: (2025)
by: Cheng, An-Chieh, et al.
Published: (2025)
3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
by: Zhen, Haoyu, et al.
Published: (2026)
by: Zhen, Haoyu, et al.
Published: (2026)
RealTraj: Towards Real-World Pedestrian Trajectory Forecasting
by: Fujii, Ryo, et al.
Published: (2024)
by: Fujii, Ryo, et al.
Published: (2024)
VIOLA: Towards Video In-Context Learning with Minimal Annotations
by: Fujii, Ryo, et al.
Published: (2026)
by: Fujii, Ryo, et al.
Published: (2026)
Towards Predicting Any Human Trajectory In Context
by: Fujii, Ryo, et al.
Published: (2025)
by: Fujii, Ryo, et al.
Published: (2025)
Heatmap Regression without Soft-Argmax for Facial Landmark Detection
by: Yang, Chiao-An, et al.
Published: (2025)
by: Yang, Chiao-An, et al.
Published: (2025)
Autoregressive Universal Video Segmentation Model
by: Heo, Miran, et al.
Published: (2025)
by: Heo, Miran, et al.
Published: (2025)
Toward Long-Tailed Online Anomaly Detection through Class-Agnostic Concepts
by: Yang, Chiao-An, et al.
Published: (2025)
by: Yang, Chiao-An, et al.
Published: (2025)
EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos
by: Hatano, Masashi, et al.
Published: (2024)
by: Hatano, Masashi, et al.
Published: (2024)
V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models
by: Chiu, Hsu-kuang, et al.
Published: (2025)
by: Chiu, Hsu-kuang, et al.
Published: (2025)
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
by: Huang, De-An, et al.
Published: (2025)
by: Huang, De-An, et al.
Published: (2025)
Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
by: Ishikawa, Reina, et al.
Published: (2025)
by: Ishikawa, Reina, et al.
Published: (2025)
Deep Nets with Subsampling Layers Unwittingly Discard Useful Activations at Test-Time
by: Yang, Chiao-An, et al.
Published: (2024)
by: Yang, Chiao-An, et al.
Published: (2024)
Weakly Semi-supervised Tool Detection in Minimally Invasive Surgery Videos
by: Fujii, Ryo, et al.
Published: (2024)
by: Fujii, Ryo, et al.
Published: (2024)
CrowdMAC: Masked Crowd Density Completion for Robust Crowd Density Forecasting
by: Fujii, Ryo, et al.
Published: (2024)
by: Fujii, Ryo, et al.
Published: (2024)
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Unified Reinforcement and Imitation Learning for Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025)
by: Lee, Byung-Kwan, et al.
Published: (2025)
Learning to Obstruct Few-Shot Image Classification over Restricted Classes
by: Zheng, Amber Yijia, et al.
Published: (2024)
by: Zheng, Amber Yijia, et al.
Published: (2024)
RegionGPT: Towards Region Understanding Vision Language Model
by: Guo, Qiushan, et al.
Published: (2024)
by: Guo, Qiushan, et al.
Published: (2024)
Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
by: Hatano, Masashi, et al.
Published: (2024)
by: Hatano, Masashi, et al.
Published: (2024)
ShapeGen4D: Towards High Quality 4D Shape Generation from Videos
by: Yenphraphai, Jiraphon, et al.
Published: (2025)
by: Yenphraphai, Jiraphon, et al.
Published: (2025)
Temporal Prompting Matters: Rethinking Referring Video Object Segmentation
by: Lin, Ci-Siang, et al.
Published: (2025)
by: Lin, Ci-Siang, et al.
Published: (2025)
GroPrompt: Efficient Grounded Prompting and Adaptation for Referring Video Object Segmentation
by: Lin, Ci-Siang, et al.
Published: (2024)
by: Lin, Ci-Siang, et al.
Published: (2024)
Helix4D: Complex 4D Mesh Generation
by: Yenphraphai, Jiraphon, et al.
Published: (2026)
by: Yenphraphai, Jiraphon, et al.
Published: (2026)
PartDistill: 3D Shape Part Segmentation by Vision-Language Model Distillation
by: Umam, Ardian, et al.
Published: (2023)
by: Umam, Ardian, et al.
Published: (2023)
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
by: Nagano, Koshiro, et al.
Published: (2026)
by: Nagano, Koshiro, et al.
Published: (2026)
From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
by: Hirota, Yusuke, et al.
Published: (2024)
by: Hirota, Yusuke, et al.
Published: (2024)
RGBD Objects in the Wild: Scaling Real-World 3D Object Learning from RGB-D Videos
by: Xia, Hongchi, et al.
Published: (2024)
by: Xia, Hongchi, et al.
Published: (2024)
DHQA-4D: Perceptual Quality Assessment of Dynamic 4D Digital Human
by: Li, Yunhao, et al.
Published: (2025)
by: Li, Yunhao, et al.
Published: (2025)
Predict-Optimize-Distill: A Self-Improving Cycle for 4D Object Understanding
by: Wu, Mingxuan, et al.
Published: (2025)
by: Wu, Mingxuan, et al.
Published: (2025)
GazeNLQ @ Ego4D Natural Language Queries Challenge 2025
by: Lin, Wei-Cheng, et al.
Published: (2025)
by: Lin, Wei-Cheng, et al.
Published: (2025)
Interpretable Debiasing of Vision-Language Models for Social Fairness
by: An, Na Min, et al.
Published: (2026)
by: An, Na Min, et al.
Published: (2026)
Self-Improving 4D Perception via Self-Distillation
by: Huang, Nan, et al.
Published: (2026)
by: Huang, Nan, et al.
Published: (2026)
Toward Scene Graph and Layout Guided Complex 3D Scene Generation
by: Huang, Yu-Hsiang, et al.
Published: (2024)
by: Huang, Yu-Hsiang, et al.
Published: (2024)
Similar Items
-
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
by: Heo, Miran, et al.
Published: (2025) -
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025) -
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025) -
SpatialRGPT: Grounded Spatial Reasoning in Vision Language Models
by: Cheng, An-Chieh, et al.
Published: (2024) -
Multi-Object 3D Grounding with Dynamic Modules and Language-Informed Spatial Attention
by: Zhang, Haomeng, et al.
Published: (2024)