Saved in:
| Main Authors: | Lee, Byung-Kwan, Hachiuma, Ryo, Ro, Yong Man, Wang, Yu-Chiang Frank, Wu, Yueh-Hua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2510.19307 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025)
by: Lee, Byung-Kwan, et al.
Published: (2025)
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025)
by: Lee, Byung-Kwan, et al.
Published: (2025)
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
by: Yu, Youngjoon, et al.
Published: (2024)
by: Yu, Youngjoon, et al.
Published: (2024)
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
MoAI: Mixture of All Intelligence for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Phantom of Latent for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
CoLLaVO: Crayon Large Language and Vision mOdel
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
TroL: Traversal of Layers for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)
by: Lee, Byung-Kwan, et al.
Published: (2024)
Causal Unsupervised Semantic Segmentation
by: Kim, Junho, et al.
Published: (2023)
by: Kim, Junho, et al.
Published: (2023)
Zoom-Zero: Reinforced Coarse-to-Fine Video Understanding via Temporal Zoom-in
by: Shen, Xiaoqian, et al.
Published: (2025)
by: Shen, Xiaoqian, et al.
Published: (2025)
ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning
by: Huang, Chi-Pin, et al.
Published: (2025)
by: Huang, Chi-Pin, et al.
Published: (2025)
V2V-LLM: Vehicle-to-Vehicle Cooperative Autonomous Driving with Multimodal Large Language Models
by: Chiu, Hsu-kuang, et al.
Published: (2025)
by: Chiu, Hsu-kuang, et al.
Published: (2025)
VIOLA: Towards Video In-Context Learning with Minimal Annotations
by: Fujii, Ryo, et al.
Published: (2026)
by: Fujii, Ryo, et al.
Published: (2026)
Omni-RGPT: Unifying Image and Video Region-level Understanding via Token Marks
by: Heo, Miran, et al.
Published: (2025)
by: Heo, Miran, et al.
Published: (2025)
RealTraj: Towards Real-World Pedestrian Trajectory Forecasting
by: Fujii, Ryo, et al.
Published: (2024)
by: Fujii, Ryo, et al.
Published: (2024)
Multimodal Cross-Domain Few-Shot Learning for Egocentric Action Recognition
by: Hatano, Masashi, et al.
Published: (2024)
by: Hatano, Masashi, et al.
Published: (2024)
SANER: Annotation-free Societal Attribute Neutralizer for Debiasing CLIP
by: Hirota, Yusuke, et al.
Published: (2024)
by: Hirota, Yusuke, et al.
Published: (2024)
Interpretable Debiasing of Vision-Language Models for Social Fairness
by: An, Na Min, et al.
Published: (2026)
by: An, Na Min, et al.
Published: (2026)
Autoregressive Universal Video Segmentation Model
by: Heo, Miran, et al.
Published: (2025)
by: Heo, Miran, et al.
Published: (2025)
Enhanced Vision-Language Models for Diverse Sensor Understanding: Cost-Efficient Optimization and Benchmarking
by: Chung, Sangyun, et al.
Published: (2024)
by: Chung, Sangyun, et al.
Published: (2024)
Remote Sensing Large Vision-Language Model: Semantic-augmented Multi-level Alignment and Semantic-aware Expert Modeling
by: Park, Sungjune, et al.
Published: (2025)
by: Park, Sungjune, et al.
Published: (2025)
Weakly Semi-supervised Tool Detection in Minimally Invasive Surgery Videos
by: Fujii, Ryo, et al.
Published: (2024)
by: Fujii, Ryo, et al.
Published: (2024)
CrowdMAC: Masked Crowd Density Completion for Robust Crowd Density Forecasting
by: Fujii, Ryo, et al.
Published: (2024)
by: Fujii, Ryo, et al.
Published: (2024)
Human Preference-Aligned Concept Customization Benchmark via Decomposed Evaluation
by: Ishikawa, Reina, et al.
Published: (2025)
by: Ishikawa, Reina, et al.
Published: (2025)
EMAG: Ego-motion Aware and Generalizable 2D Hand Forecasting from Egocentric Videos
by: Hatano, Masashi, et al.
Published: (2024)
by: Hatano, Masashi, et al.
Published: (2024)
4D-RGPT: Toward Region-level 4D Understanding via Perceptual Distillation
by: Yang, Chiao-An, et al.
Published: (2025)
by: Yang, Chiao-An, et al.
Published: (2025)
ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding
by: Lee, Hosu, et al.
Published: (2025)
by: Lee, Hosu, et al.
Published: (2025)
Integrating Language-Derived Appearance Elements with Visual Cues in Pedestrian Detection
by: Park, Sungjune, et al.
Published: (2023)
by: Park, Sungjune, et al.
Published: (2023)
Language-guided Learning for Object Detection Tackling Multiple Variations in Aerial Images
by: Park, Sungjune, et al.
Published: (2025)
by: Park, Sungjune, et al.
Published: (2025)
Towards Predicting Any Human Trajectory In Context
by: Fujii, Ryo, et al.
Published: (2025)
by: Fujii, Ryo, et al.
Published: (2025)
LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
by: Hirota, Yusuke, et al.
Published: (2025)
by: Hirota, Yusuke, et al.
Published: (2025)
Revisiting Misalignment in Multispectral Pedestrian Detection: A Language-Driven Approach for Cross-modal Alignment Fusion
by: Kim, Taeheon, et al.
Published: (2024)
by: Kim, Taeheon, et al.
Published: (2024)
QuarterMap: Efficient Post-Training Token Pruning for Visual State Space Models
by: Chi, Tien-Yu, et al.
Published: (2025)
by: Chi, Tien-Yu, et al.
Published: (2025)
Robust Egocentric Visual Attention Prediction Through Language-guided Scene Context-aware Learning
by: Park, Sungjune, et al.
Published: (2026)
by: Park, Sungjune, et al.
Published: (2026)
Histopathology Image Report Generation by Vision Language Model with Multimodal In-Context Learning
by: Liu, Shih-Wen, et al.
Published: (2025)
by: Liu, Shih-Wen, et al.
Published: (2025)
Robust Pedestrian Detection via Constructing Versatile Pedestrian Knowledge Bank
by: Park, Sungjune, et al.
Published: (2024)
by: Park, Sungjune, et al.
Published: (2024)
From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
by: Hirota, Yusuke, et al.
Published: (2024)
by: Hirota, Yusuke, et al.
Published: (2024)
Learning from Synthetic Data via Provenance-Based Input Gradient Guidance
by: Nagano, Koshiro, et al.
Published: (2026)
by: Nagano, Koshiro, et al.
Published: (2026)
SALOVA: Segment-Augmented Long Video Assistant for Targeted Retrieval and Routing in Long-Form Video Analysis
by: Kim, Junho, et al.
Published: (2024)
by: Kim, Junho, et al.
Published: (2024)
Similar Items
-
VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models
by: Lee, Byung-Kwan, et al.
Published: (2024) -
Masking Teacher and Reinforcing Student for Distilling Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025) -
GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
by: Lee, Byung-Kwan, et al.
Published: (2025) -
SPARK: Multi-Vision Sensor Perception and Reasoning Benchmark for Large-scale Vision-Language Models
by: Yu, Youngjoon, et al.
Published: (2024) -
Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
by: Lee, Byung-Kwan, et al.
Published: (2024)