Saved in:
| Main Authors: | Liu, Rex, Liu, Xin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2408.04243 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
by: Araujo, Edson, et al.
Published: (2025)
by: Araujo, Edson, et al.
Published: (2025)
PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment
by: Shan, Ziyu, et al.
Published: (2024)
by: Shan, Ziyu, et al.
Published: (2024)
FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
by: Cai, Lingling, et al.
Published: (2024)
by: Cai, Lingling, et al.
Published: (2024)
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)
by: Fernandez-Lopez, Adriana, et al.
Published: (2024)
Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
by: Tang, Hao, et al.
Published: (2025)
by: Tang, Hao, et al.
Published: (2025)
One Framework to Rule Them All: Unifying Multimodal Tasks with LLM Neural-Tuning
by: Sun, Hao, et al.
Published: (2024)
by: Sun, Hao, et al.
Published: (2024)
Detached and Interactive Multimodal Learning
by: Fan, Yunfeng, et al.
Published: (2024)
by: Fan, Yunfeng, et al.
Published: (2024)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)
by: Zeng, Donghuo, et al.
Published: (2026)
Talking Head Generation Driven by Speech-Related Facial Action Units and Audio- Based on Multimodal Representation Fusion
by: Chen, Sen, et al.
Published: (2022)
by: Chen, Sen, et al.
Published: (2022)
SMC++: Masked Learning of Unsupervised Video Semantic Compression
by: Tian, Yuan, et al.
Published: (2024)
by: Tian, Yuan, et al.
Published: (2024)
Zero-Shot Character Identification and Speaker Prediction in Comics via Iterative Multimodal Fusion
by: Li, Yingxuan, et al.
Published: (2024)
by: Li, Yingxuan, et al.
Published: (2024)
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
by: Cai, Haonan, et al.
Published: (2026)
by: Cai, Haonan, et al.
Published: (2026)
XY-Cut++: Advanced Layout Ordering via Hierarchical Mask Mechanism on a Novel Benchmark
by: Liu, Shuai, et al.
Published: (2025)
by: Liu, Shuai, et al.
Published: (2025)
Scaling and Masking: A New Paradigm of Data Sampling for Image and Video Quality Assessment
by: Liu, Yongxu, et al.
Published: (2024)
by: Liu, Yongxu, et al.
Published: (2024)
CONSTANT: Towards High-Quality One-Shot Handwriting Generation with Patch Contrastive Enhancement and Style-Aware Quantization
by: Le, Anh-Duy, et al.
Published: (2026)
by: Le, Anh-Duy, et al.
Published: (2026)
Kubrick: Multimodal Agent Collaborations for Synthetic Video Generation
by: He, Liu, et al.
Published: (2024)
by: He, Liu, et al.
Published: (2024)
MHAD: Multimodal Home Activity Dataset with Multi-Angle Videos and Synchronized Physiological Signals
by: Yu, Lei, et al.
Published: (2024)
by: Yu, Lei, et al.
Published: (2024)
FedVideoMAE: Efficient Privacy-Preserving Federated Video Moderation
by: Tao, Ziyuan, et al.
Published: (2025)
by: Tao, Ziyuan, et al.
Published: (2025)
LinMU: Multimodal Understanding Made Linear
by: Wang, Hongjie, et al.
Published: (2026)
by: Wang, Hongjie, et al.
Published: (2026)
Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
by: Liao, Liwei, et al.
Published: (2025)
by: Liao, Liwei, et al.
Published: (2025)
Fine-grained Textual Inversion Network for Zero-Shot Composed Image Retrieval
by: Lin, Haoqiang, et al.
Published: (2025)
by: Lin, Haoqiang, et al.
Published: (2025)
Generalizable Deepfake Detection Based on Forgery-aware Layer Masking and Multi-artifact Subspace Decomposition
by: Zhang, Xiang, et al.
Published: (2026)
by: Zhang, Xiang, et al.
Published: (2026)
MotionDreamer: One-to-Many Motion Synthesis with Localized Generative Masked Transformer
by: Wang, Yilin, et al.
Published: (2025)
by: Wang, Yilin, et al.
Published: (2025)
Can Multimodal Large Language Models Understand Spatial Relations?
by: Liu, Jingping, et al.
Published: (2025)
by: Liu, Jingping, et al.
Published: (2025)
OneDiff: A Generalist Model for Image Difference Captioning
by: Hu, Erdong, et al.
Published: (2024)
by: Hu, Erdong, et al.
Published: (2024)
A Dual-Module Denoising Approach with Curriculum Learning for Enhancing Multimodal Aspect-Based Sentiment Analysis
by: Van Doan, Nguyen, et al.
Published: (2024)
by: Van Doan, Nguyen, et al.
Published: (2024)
Principled Multimodal Representation Learning
by: Liu, Xiaohao, et al.
Published: (2025)
by: Liu, Xiaohao, et al.
Published: (2025)
DreamArtist++: Controllable One-Shot Text-to-Image Generation via Positive-Negative Adapter
by: Dong, Ziyi, et al.
Published: (2022)
by: Dong, Ziyi, et al.
Published: (2022)
VCoME: Verbal Video Composition with Multimodal Editing Effects
by: Gong, Weibo, et al.
Published: (2024)
by: Gong, Weibo, et al.
Published: (2024)
Incorporating Visual Experts to Resolve the Information Loss in Multimodal Large Language Models
by: He, Xin, et al.
Published: (2024)
by: He, Xin, et al.
Published: (2024)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
Advancing Unsupervised Low-light Image Enhancement: Noise Estimation, Illumination Interpolation, and Self-Regulation
by: Liu, Xiaofeng, et al.
Published: (2023)
by: Liu, Xiaofeng, et al.
Published: (2023)
RoboTron-Drive: All-in-One Large Multimodal Model for Autonomous Driving
by: Huang, Zhijian, et al.
Published: (2024)
by: Huang, Zhijian, et al.
Published: (2024)
Graph-Driven Multimodal Feature Learning Framework for Apparent Personality Assessment
by: Wang, Kangsheng, et al.
Published: (2025)
by: Wang, Kangsheng, et al.
Published: (2025)
LMM4Edit: Benchmarking and Evaluating Multimodal Image Editing with LMMs
by: Xu, Zitong, et al.
Published: (2025)
by: Xu, Zitong, et al.
Published: (2025)
Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
by: Zheng, Guangting, et al.
Published: (2025)
by: Zheng, Guangting, et al.
Published: (2025)
DAE-Talker: High Fidelity Speech-Driven Talking Face Generation with Diffusion Autoencoder
by: Du, Chenpeng, et al.
Published: (2023)
by: Du, Chenpeng, et al.
Published: (2023)
MEDTalk: Multimodal Controlled 3D Facial Animation with Dynamic Emotions by Disentangled Embedding
by: Liu, Chang, et al.
Published: (2025)
by: Liu, Chang, et al.
Published: (2025)
M2ORT: Many-To-One Regression Transformer for Spatial Transcriptomics Prediction from Histopathology Images
by: Wang, Hongyi, et al.
Published: (2024)
by: Wang, Hongyi, et al.
Published: (2024)
QPT V2: Masked Image Modeling Advances Visual Scoring
by: Xie, Qizhi, et al.
Published: (2024)
by: Xie, Qizhi, et al.
Published: (2024)
Similar Items
-
CAV-MAE Sync: Improving Contrastive Audio-Visual Mask Autoencoders via Fine-Grained Alignment
by: Araujo, Edson, et al.
Published: (2025) -
PAME: Self-Supervised Masked Autoencoder for No-Reference Point Cloud Quality Assessment
by: Shan, Ziyu, et al.
Published: (2024) -
FreeMask: Rethinking the Importance of Attention Masks for Zero-Shot Video Editing
by: Cai, Lingling, et al.
Published: (2024) -
MSRS: Training Multimodal Speech Recognition Models from Scratch with Sparse Mask Optimization
by: Fernandez-Lopez, Adriana, et al.
Published: (2024) -
Connecting Giants: Synergistic Knowledge Transfer of Large Multimodal Models for Few-Shot Learning
by: Tang, Hao, et al.
Published: (2025)