Saved in:
| Main Authors: | Lee, Sangmin, Lai, Bolin, Ryan, Fiona, Boote, Bikram, Rehg, James M. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2403.02090 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
by: Kim, Junho, et al.
Published: (2026)
by: Kim, Junho, et al.
Published: (2026)
SocialGesture: Delving into Multi-person Gesture Understanding
by: Cao, Xu, et al.
Published: (2025)
by: Cao, Xu, et al.
Published: (2025)
MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models
by: Thai, Anh, et al.
Published: (2025)
by: Thai, Anh, et al.
Published: (2025)
In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation
by: Lai, Bolin, et al.
Published: (2022)
by: Lai, Bolin, et al.
Published: (2022)
Leveraging Object Priors for Point Tracking
by: Boote, Bikram, et al.
Published: (2024)
by: Boote, Bikram, et al.
Published: (2024)
Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
by: Lai, Bolin, et al.
Published: (2023)
by: Lai, Bolin, et al.
Published: (2023)
Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning
by: Lai, Bolin, et al.
Published: (2025)
by: Lai, Bolin, et al.
Published: (2025)
Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
by: Ryan, Fiona, et al.
Published: (2024)
by: Ryan, Fiona, et al.
Published: (2024)
Towards Online Multi-Modal Social Interaction Understanding
by: Li, Xinpeng, et al.
Published: (2025)
by: Li, Xinpeng, et al.
Published: (2025)
Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
by: Li, Xinpeng, et al.
Published: (2026)
by: Li, Xinpeng, et al.
Published: (2026)
Towards Social AI: A Survey on Understanding Social Interactions
by: Lee, Sangmin, et al.
Published: (2024)
by: Lee, Sangmin, et al.
Published: (2024)
Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
by: Lai, Bolin, et al.
Published: (2024)
by: Lai, Bolin, et al.
Published: (2024)
Learning Predictive Visuomotor Coordination
by: Jia, Wenqi, et al.
Published: (2025)
by: Jia, Wenqi, et al.
Published: (2025)
LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
by: Lai, Bolin, et al.
Published: (2023)
by: Lai, Bolin, et al.
Published: (2023)
$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
by: Patel, Maitreya, et al.
Published: (2024)
by: Patel, Maitreya, et al.
Published: (2024)
AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
by: Wu, Yuhang, et al.
Published: (2024)
by: Wu, Yuhang, et al.
Published: (2024)
Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
by: Lai, Bolin, et al.
Published: (2025)
by: Lai, Bolin, et al.
Published: (2025)
Bilingual Text-to-Motion Generation: A New Benchmark and Baselines
by: Weng, Wanjiang, et al.
Published: (2026)
by: Weng, Wanjiang, et al.
Published: (2026)
MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
by: Xia, Yinan, et al.
Published: (2025)
by: Xia, Yinan, et al.
Published: (2025)
MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs
by: Ye, Wenqian, et al.
Published: (2024)
by: Ye, Wenqian, et al.
Published: (2024)
Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
by: Li, Zhuowan, et al.
Published: (2022)
by: Li, Zhuowan, et al.
Published: (2022)
Improving Personalized Search with Regularized Low-Rank Parameter Updates
by: Ryan, Fiona, et al.
Published: (2025)
by: Ryan, Fiona, et al.
Published: (2025)
EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
by: Tan, Zhiyu, et al.
Published: (2024)
by: Tan, Zhiyu, et al.
Published: (2024)
Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models
by: Yuxuan, Cao, et al.
Published: (2025)
by: Yuxuan, Cao, et al.
Published: (2025)
Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025)
by: Kang, Caixin, et al.
Published: (2025)
Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
by: Shen, Yifan, et al.
Published: (2025)
by: Shen, Yifan, et al.
Published: (2025)
SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes
by: Wang, Chuhan, et al.
Published: (2026)
by: Wang, Chuhan, et al.
Published: (2026)
Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
by: Chen, Boqi, et al.
Published: (2026)
by: Chen, Boqi, et al.
Published: (2026)
GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning
by: Lee, Taeryung, et al.
Published: (2025)
by: Lee, Taeryung, et al.
Published: (2025)
MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
by: Ruan, Jiacheng, et al.
Published: (2025)
by: Ruan, Jiacheng, et al.
Published: (2025)
Human Semantic Representations of Social Interactions from Moving Shapes
by: Yun, Yiling, et al.
Published: (2025)
by: Yun, Yiling, et al.
Published: (2025)
Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
by: Hua, Jiacheng, et al.
Published: (2026)
by: Hua, Jiacheng, et al.
Published: (2026)
Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
by: Ryan, Yuriel, et al.
Published: (2025)
by: Ryan, Yuriel, et al.
Published: (2025)
From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
by: Toschi, Federico, et al.
Published: (2026)
by: Toschi, Federico, et al.
Published: (2026)
See It All: Contextualized Late Aggregation for 3D Dense Captioning
by: Kim, Minjung, et al.
Published: (2024)
by: Kim, Minjung, et al.
Published: (2024)
PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
by: Huang, Kui, et al.
Published: (2025)
by: Huang, Kui, et al.
Published: (2025)
Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)
by: Woo, Sangmin, et al.
Published: (2025)
Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)
by: Hsu, Ting-Yao E., et al.
Published: (2025)
SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection
by: Zhou, Jingxuan, et al.
Published: (2025)
by: Zhou, Jingxuan, et al.
Published: (2025)
Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)
by: Lee, Jungbeom, et al.
Published: (2024)
Similar Items
-
GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
by: Kim, Junho, et al.
Published: (2026) -
SocialGesture: Delving into Multi-person Gesture Understanding
by: Cao, Xu, et al.
Published: (2025) -
MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models
by: Thai, Anh, et al.
Published: (2025) -
In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation
by: Lai, Bolin, et al.
Published: (2022) -
Leveraging Object Priors for Point Tracking
by: Boote, Bikram, et al.
Published: (2024)