:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Liu, Rex, Liu, Xin
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Multimedia
Online Access:	https://arxiv.org/abs/2408.04243
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
by: Araujo, Edson, et al.
Published: (2025)

PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment
by: Shan, Ziyu, et al.
Published: (2024)

FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
by: Cai, Lingling, et al.
Published: (2024)

MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)

Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
by: Tang, Hao, et al.
Published: (2025)

One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
by: Sun, Hao, et al.
Published: (2024)

Detached and Interactive Multimodal Learning
by: Fan, Yunfeng, et al.
Published: (2024)

Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)

Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion
by: Chen, Sen, et al.
Published: (2022)

SMC++: Masked Learning of Unsupervised Video Semantic Compression
by: Tian, Yuan, et al.
Published: (2024)

Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
by: Li, Yingxuan, et al.
Published: (2024)

Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
by: Cai, Haonan, et al.
Published: (2026)

XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark
by: Liu, Shuai, et al.
Published: (2025)

Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment
by: Liu, Yongxu, et al.
Published: (2024)

CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
by: Le, Anh-Duy, et al.
Published: (2026)

Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
by: He, Liu, et al.
Published: (2024)

MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals
by: Yu, Lei, et al.
Published: (2024)

FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
by: Tao, Ziyuan, et al.
Published: (2025)

LinMU: Multimodal Understanding Made Linear
by: Wang, Hongjie, et al.
Published: (2026)

Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
by: Liao, Liwei, et al.
Published: (2025)

Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
by: Lin, Haoqiang, et al.
Published: (2025)

Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
by: Zhang, Xiang, et al.
Published: (2026)

MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer
by: Wang, Yilin, et al.
Published: (2025)

Can Multimodal Large Language Models Understand Spatial Relations?
by: Liu, Jingping, et al.
Published: (2025)

OneDiff: A Generalist Model for Image Difference Captioning
by: Hu, Erdong, et al.
Published: (2024)

A Dual-Module Denoising Approach with Curriculum Learning for Enhancing Multimodal Aspect-Based Sentiment Analysis
by: Van Doan, Nguyen, et al.
Published: (2024)

Principled Multimodal Representation Learning
by: Liu, Xiaohao, et al.
Published: (2025)

DreamArtist++: Controllable One-Shot Text-to-Image Generation via Positive-Negative Adapter
by: Dong, Ziyi, et al.
Published: (2022)

VCoME: Verbal Video Composition with Multimodal Editing Effects
by: Gong, Weibo, et al.
Published: (2024)

Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
by: He, Xin, et al.
Published: (2024)

Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)

Advancing Unsupervised Low-light Image Enhancement: Noise Estimation, Illumination Interpolation, and Self-Regulation
by: Liu, Xiaofeng, et al.
Published: (2023)

RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
by: Huang, Zhijian, et al.
Published: (2024)

Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
by: Wang, Kangsheng, et al.
Published: (2025)

LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
by: Xu, Zitong, et al.
Published: (2025)

Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
by: Zheng, Guangting, et al.
Published: (2025)

DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
by: Du, Chenpeng, et al.
Published: (2023)

MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
by: Liu, Chang, et al.
Published: (2025)

M2ORT: Many-To-One Regression Transformer for Spatial Transcriptomics Prediction from Histopathology Images
by: Wang, Hongyi, et al.
Published: (2024)

QPT V2: Masked Image Modeling Advances Visual Scoring
by: Xie, Qizhi, et al.
Published: (2024)