Saved in:
| Main Authors: | Kang, Caixin, Yan, Tianyu, Gong, Sitong, Zhang, Mingfang, Ouyang, Liangyang, Liu, Ruicong, Zheng, Bo, Lu, Huchuan, Zhang, Kaipeng, Sato, Yoichi, Huang, Yifei |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2605.22109 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025)
by: Kang, Caixin, et al.
Published: (2025)
Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025)
by: Kang, Caixin, et al.
Published: (2025)
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
by: Ouyang, Liangyang, et al.
Published: (2026)
by: Ouyang, Liangyang, et al.
Published: (2026)
SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
by: Liu, Ruicong, et al.
Published: (2025)
by: Liu, Ruicong, et al.
Published: (2025)
Multi-speaker Attention Alignment for Multimodal Social Interaction
by: Ouyang, Liangyang, et al.
Published: (2025)
by: Ouyang, Liangyang, et al.
Published: (2025)
Egocentric Action-aware Inertial Localization in Point Clouds with Vision-Language Guidance
by: Zhang, Mingfang, et al.
Published: (2025)
by: Zhang, Mingfang, et al.
Published: (2025)
Living the Novel: A System for Generating Self-Training Timeline-Aware Conversational Agents from Novels
by: Huang, Yifei, et al.
Published: (2025)
by: Huang, Yifei, et al.
Published: (2025)
ActionVOS: Actions as Prompts for Video Object Segmentation
by: Ouyang, Liangyang, et al.
Published: (2024)
by: Ouyang, Liangyang, et al.
Published: (2024)
Masked Video and Body-worn IMU Autoencoder for Egocentric Action Recognition
by: Zhang, Mingfang, et al.
Published: (2024)
by: Zhang, Mingfang, et al.
Published: (2024)
Single-to-Dual-View Adaptation for Egocentric 3D Hand Pose Estimation
by: Liu, Ruicong, et al.
Published: (2024)
by: Liu, Ruicong, et al.
Published: (2024)
Leveraging RGB Images for Pre-Training of Event-Based Hand Pose Estimation
by: Liu, Ruicong, et al.
Published: (2025)
by: Liu, Ruicong, et al.
Published: (2025)
CaST-Bench: Benchmarking Causal Chain-Grounded Spatio-Temporal Reasoning for Video Question Answering
by: Zhang, Mingfang, et al.
Published: (2026)
by: Zhang, Mingfang, et al.
Published: (2026)
Leadership Assessment in Pediatric Intensive Care Unit Team Training
by: Ouyang, Liangyang, et al.
Published: (2025)
by: Ouyang, Liangyang, et al.
Published: (2025)
Pre-Training for 3D Hand Pose Estimation with Contrastive Learning on Large-Scale Hand Images in the Wild
by: Lin, Nie, et al.
Published: (2024)
by: Lin, Nie, et al.
Published: (2024)
LORE: Latent Optimization for Precise Semantic Control in Rectified Flow-based Image Editing
by: Ouyang, Liangyang, et al.
Published: (2025)
by: Ouyang, Liangyang, et al.
Published: (2025)
Complementary and Contrastive Learning for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
Towards Interactive Intelligence for Digital Humans
by: Cai, Yiyi, et al.
Published: (2025)
by: Cai, Yiyi, et al.
Published: (2025)
AssemblyHands-X: Modeling 3D Hand-Body Coordination for Understanding Bimanual Human Activities
by: Banno, Tatsuro, et al.
Published: (2025)
by: Banno, Tatsuro, et al.
Published: (2025)
SiMHand: Mining Similar Hands for Large-Scale 3D Hand Pose Pre-training
by: Lin, Nie, et al.
Published: (2025)
by: Lin, Nie, et al.
Published: (2025)
The N-Body Problem: Parallel Execution from Single-Person Egocentric Video
by: Zhu, Zhifan, et al.
Published: (2025)
by: Zhu, Zhifan, et al.
Published: (2025)
Parameter Aware Mamba Model for Multi-task Dense Prediction
by: Yu, Xinzhuo, et al.
Published: (2025)
by: Yu, Xinzhuo, et al.
Published: (2025)
Reinforcing Video Reasoning Segmentation to Think Before It Segments
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
The Devil is in Temporal Token: High Quality Video Reasoning Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
AVS-Mamba: Exploring Temporal and Multi-modal Mamba for Audio-Visual Segmentation
by: Gong, Sitong, et al.
Published: (2025)
by: Gong, Sitong, et al.
Published: (2025)
Enhancing Impression Change Prediction in Speed Dating Simulations Based on Speakers' Personalities
by: Matsuo, Kazuya, et al.
Published: (2025)
by: Matsuo, Kazuya, et al.
Published: (2025)
Can MLLMs Understand the Deep Implication Behind Chinese Images?
by: Zhang, Chenhao, et al.
Published: (2024)
by: Zhang, Chenhao, et al.
Published: (2024)
MLLMs-Augmented Visual-Language Representation Learning
by: Liu, Yanqing, et al.
Published: (2023)
by: Liu, Yanqing, et al.
Published: (2023)
Enhancing Representation Learning of EEG Data with Masked Autoencoders
by: Zhou, Yifei, et al.
Published: (2024)
by: Zhou, Yifei, et al.
Published: (2024)
Prompt and Prejudice
by: Berlincioni, Lorenzo, et al.
Published: (2024)
by: Berlincioni, Lorenzo, et al.
Published: (2024)
Linking Perception, Confidence and Accuracy in MLLMs
by: Du, Yuetian, et al.
Published: (2026)
by: Du, Yuetian, et al.
Published: (2026)
Subjective Face Transform using Human First Impressions
by: Roygaga, Chaitanya, et al.
Published: (2023)
by: Roygaga, Chaitanya, et al.
Published: (2023)
Beyond First Impressions: Integrating Joint Multi-modal Cues for Comprehensive 3D Representation
by: Wang, Haowei, et al.
Published: (2023)
by: Wang, Haowei, et al.
Published: (2023)
Fantastic Animals and Where to Find Them: Segment Any Marine Animal with Dual SAM
by: Zhang, Pingping, et al.
Published: (2024)
by: Zhang, Pingping, et al.
Published: (2024)
Multi-Scale and Detail-Enhanced Segment Anything Model for Salient Object Detection
by: Gao, Shixuan, et al.
Published: (2024)
by: Gao, Shixuan, et al.
Published: (2024)
Human-Aligned Bench: Fine-Grained Assessment of Reasoning Ability in MLLMs vs. Humans
by: Qiu, Yansheng, et al.
Published: (2025)
by: Qiu, Yansheng, et al.
Published: (2025)
Can Impressions of Music be Extracted from Thumbnail Images?
by: Harada, Takashi, et al.
Published: (2025)
by: Harada, Takashi, et al.
Published: (2025)
Can MLLMs Perform Text-to-Image In-Context Learning?
by: Zeng, Yuchen, et al.
Published: (2024)
by: Zeng, Yuchen, et al.
Published: (2024)
LATex: Leveraging Attribute-based Text Knowledge for Aerial-Ground Person Re-Identification
by: Zhang, Pingping, et al.
Published: (2025)
by: Zhang, Pingping, et al.
Published: (2025)
X-ReID: Multi-granularity Information Interaction for Video-Based Visible-Infrared Person Re-Identification
by: Yu, Chenyang, et al.
Published: (2025)
by: Yu, Chenyang, et al.
Published: (2025)
Coarse-to-Fine Personalized LLM Impressions for Streamlined Radiology Reports
by: Sun, Chengbo, et al.
Published: (2025)
by: Sun, Chengbo, et al.
Published: (2025)
Similar Items
-
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025) -
Can MLLMs Read the Room? A Multimodal Benchmark for Verifying Truthfulness in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025) -
SocialDirector: Training-Free Social Interaction Control for Multi-Person Video Generation
by: Ouyang, Liangyang, et al.
Published: (2026) -
SFHand: Learning Embodied Manipulation by Streaming Egocentric 3D Hand Forecasting
by: Liu, Ruicong, et al.
Published: (2025) -
Multi-speaker Attention Alignment for Multimodal Social Interaction
by: Ouyang, Liangyang, et al.
Published: (2025)