:: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Lee, Sangmin, Lai, Bolin, Ryan, Fiona, Boote, Bikram, Rehg, James M.
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Computation and Language Machine Learning
Online Access:	https://arxiv.org/abs/2403.02090
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

GRASP: Learning to Ground Social Reasoning in Multi-Person Non-Verbal Interactions
by: Kim, Junho, et al.
Published: (2026)

SocialGesture: Delving into Multi-person Gesture Understanding
by: Cao, Xu, et al.
Published: (2025)

MEBench: A Novel Benchmark for Understanding Mutual Exclusivity Bias in Vision-Language Models
by: Thai, Anh, et al.
Published: (2025)

In the Eye of Transformer: Global-Local Correlation for Egocentric Gaze Estimation
by: Lai, Bolin, et al.
Published: (2022)

Leveraging Object Priors for Point Tracking
by: Boote, Bikram, et al.
Published: (2024)

Listen to Look into the Future: Audio-Visual Egocentric Gaze Anticipation
by: Lai, Bolin, et al.
Published: (2023)

Unified Text-Image-to-Video Generation: A Training-Free Approach to Flexible Visual Conditioning
by: Lai, Bolin, et al.
Published: (2025)

Gaze-LLE: Gaze Target Estimation via Large-Scale Learned Encoders
by: Ryan, Fiona, et al.
Published: (2024)

Towards Online Multi-Modal Social Interaction Understanding
by: Li, Xinpeng, et al.
Published: (2025)

Omni-MMSI: Toward Identity-attributed Social Interaction Understanding
by: Li, Xinpeng, et al.
Published: (2026)

Towards Social AI: A Survey on Understanding Social Interactions
by: Lee, Sangmin, et al.
Published: (2024)

Unleashing In-context Learning of Autoregressive Models for Few-shot Image Manipulation
by: Lai, Bolin, et al.
Published: (2024)

Learning Predictive Visuomotor Coordination
by: Jia, Wenqi, et al.
Published: (2025)

LEGO: Learning EGOcentric Action Frame Generation via Visual Instruction Tuning
by: Lai, Bolin, et al.
Published: (2023)

$λ$-ECLIPSE: Multi-Concept Personalized Text-to-Image Diffusion Models by Leveraging CLIP Latent Space
by: Patel, Maitreya, et al.
Published: (2024)

AlignMMBench: Evaluating Chinese Multimodal Alignment in Large Vision-Language Models
by: Wu, Yuhang, et al.
Published: (2024)

Toward Diffusible High-Dimensional Latent Spaces: A Frequency Perspective
by: Lai, Bolin, et al.
Published: (2025)

Bilingual Text-to-Motion Generation: A New Benchmark and Baselines
by: Weng, Wanjiang, et al.
Published: (2026)

MSR-Align: Policy-Grounded Multimodal Alignment for Safety-Aware Reasoning in Vision-Language Models
by: Xia, Yinan, et al.
Published: (2025)

MM-SpuBench: Towards Better Understanding of Spurious Biases in Multimodal LLMs
by: Ye, Wenqian, et al.
Published: (2024)

Localization vs. Semantics: Visual Representations in Unimodal and Multimodal Models
by: Li, Zhuowan, et al.
Published: (2022)

Improving Personalized Search with Regularized Low-Rank Parameter Updates
by: Ryan, Fiona, et al.
Published: (2025)

EVALALIGN: Supervised Fine-Tuning Multimodal LLMs with Human-Aligned Data for Evaluating Text-to-Image Models
by: Tan, Zhiyu, et al.
Published: (2024)

Detecting Offensive Memes with Social Biases in Singapore Context Using Multimodal Large Language Models
by: Yuxuan, Cao, et al.
Published: (2025)

Can MLLMs Read the Room? A Multimodal Benchmark for Assessing Deception in Multi-Party Social Interactions
by: Kang, Caixin, et al.
Published: (2025)

Fine-Grained Preference Optimization Improves Spatial Reasoning in VLMs
by: Shen, Yifan, et al.
Published: (2025)

SceneAlign: Aligning Multimodal Reasoning to Scene Graphs in Complex Visual Scenes
by: Wang, Chuhan, et al.
Published: (2026)

Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding
by: Chen, Boqi, et al.
Published: (2026)

GLOS: Sign Language Generation with Temporally Aligned Gloss-Level Conditioning
by: Lee, Taeryung, et al.
Published: (2025)

MME-SCI: A Comprehensive and Challenging Science Benchmark for Multimodal Large Language Models
by: Ruan, Jiacheng, et al.
Published: (2025)

Human Semantic Representations of Social Interactions from Moving Shapes
by: Yun, Yiling, et al.
Published: (2025)

Unleashing Spatial Reasoning in Multimodal Large Language Models via Textual Representation Guided Reasoning
by: Hua, Jiacheng, et al.
Published: (2026)

Humor in Pixels: Benchmarking Large Multimodal Models Understanding of Online Comics
by: Ryan, Yuriel, et al.
Published: (2025)

From Instructions to Assistance: a Dataset Aligning Instruction Manuals with Assembly Videos for Evaluating Multimodal LLMs
by: Toschi, Federico, et al.
Published: (2026)

See It All: Contextualized Late Aggregation for 3D Dense Captioning
by: Kim, Minjung, et al.
Published: (2024)

PP-DocBee2: Improved Baselines with Efficient Data for Multimodal Document Understanding
by: Huang, Kui, et al.
Published: (2025)

Black-Box Visual Prompt Engineering for Mitigating Object Hallucination in Large Vision Language Models
by: Woo, Sangmin, et al.
Published: (2025)

Do Large Multimodal Models Solve Caption Generation for Scientific Figures? Lessons Learned from SciCap Challenge 2023
by: Hsu, Ting-Yao E., et al.
Published: (2025)

SemIRNet: A Semantic Irony Recognition Network for Multimodal Sarcasm Detection
by: Zhou, Jingxuan, et al.
Published: (2025)

Toward Interactive Regional Understanding in Vision-Large Language Models
by: Lee, Jungbeom, et al.
Published: (2024)