Saved in:
| Main Authors: | Castrejon, Lluis, Mensink, Thomas, Zhou, Howard, Ferrari, Vittorio, Araujo, Andre, Uijlings, Jasper |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.05465 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
by: Zhang, Shiyi, et al.
Published: (2025)
by: Zhang, Shiyi, et al.
Published: (2025)
HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
by: Zhang, Shiyi, et al.
Published: (2025)
by: Zhang, Shiyi, et al.
Published: (2025)
VQA Training Sets are Self-play Environments for Generating Few-shot Pools
by: Misiunas, Tautvydas, et al.
Published: (2024)
by: Misiunas, Tautvydas, et al.
Published: (2024)
MultiModal Action Conditioned Video Generation
by: Li, Yichen, et al.
Published: (2025)
by: Li, Yichen, et al.
Published: (2025)
MultiModal Fine-tuning with Synthetic Captions
by: Enomoto, Shohei, et al.
Published: (2026)
by: Enomoto, Shohei, et al.
Published: (2026)
MMA-Diffusion: MultiModal Attack on Diffusion Models
by: Yang, Yijun, et al.
Published: (2023)
by: Yang, Yijun, et al.
Published: (2023)
M3: 3D-Spatial MultiModal Memory
by: Zou, Xueyan, et al.
Published: (2025)
by: Zou, Xueyan, et al.
Published: (2025)
Re-M3Dr: Rebalanced MultiModal Mean Deviation Regression
by: Yin, Haojie, et al.
Published: (2026)
by: Yin, Haojie, et al.
Published: (2026)
ControlEdit: A MultiModal Local Clothing Image Editing Method
by: Cheng, Di, et al.
Published: (2024)
by: Cheng, Di, et al.
Published: (2024)
CapS-Adapter: Caption-based MultiModal Adapter in Zero-Shot Classification
by: Wang, Qijie, et al.
Published: (2024)
by: Wang, Qijie, et al.
Published: (2024)
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
by: Li, Zijie, et al.
Published: (2026)
by: Li, Zijie, et al.
Published: (2026)
VoCap: Video Object Captioning and Segmentation from Any Prompt
by: Uijlings, Jasper, et al.
Published: (2025)
by: Uijlings, Jasper, et al.
Published: (2025)
CharGen: High Accurate Character-Level Visual Text Generation Model with MultiModal Encoder
by: Ma, Lichen, et al.
Published: (2024)
by: Ma, Lichen, et al.
Published: (2024)
ProM3E: Probabilistic Masked MultiModal Embedding Model for Ecology
by: Sastry, Srikumar, et al.
Published: (2025)
by: Sastry, Srikumar, et al.
Published: (2025)
Can MLLMs Reason in Multimodality? EMMA: An Enhanced MultiModal ReAsoning Benchmark
by: Hao, Yunzhuo, et al.
Published: (2025)
by: Hao, Yunzhuo, et al.
Published: (2025)
MMRPT: MultiModal Reinforcement Pre-Training via Masked Vision-Dependent Reasoning
by: Zheng, Xuhui, et al.
Published: (2025)
by: Zheng, Xuhui, et al.
Published: (2025)
TeamPath: Building MultiModal Pathology Experts with Reasoning AI Copilots
by: Liu, Tianyu, et al.
Published: (2025)
by: Liu, Tianyu, et al.
Published: (2025)
LongHalQA: Long-Context Hallucination Evaluation for MultiModal Large Language Models
by: Qiu, Han, et al.
Published: (2024)
by: Qiu, Han, et al.
Published: (2024)
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation
by: Chen, Yuheng, et al.
Published: (2026)
by: Chen, Yuheng, et al.
Published: (2026)
Visual-Oriented Fine-Grained Knowledge Editing for MultiModal Large Language Models
by: Zeng, Zhen, et al.
Published: (2024)
by: Zeng, Zhen, et al.
Published: (2024)
M3FAS: An Accurate and Robust MultiModal Mobile Face Anti-Spoofing System
by: Kong, Chenqi, et al.
Published: (2023)
by: Kong, Chenqi, et al.
Published: (2023)
M3R: Localized Rainfall Nowcasting with Meteorology-Informed MultiModal Attention
by: Panta, Sanjeev, et al.
Published: (2026)
by: Panta, Sanjeev, et al.
Published: (2026)
MMA-DFER: MultiModal Adaptation of unimodal models for Dynamic Facial Expression Recognition in-the-wild
by: Chumachenko, Kateryna, et al.
Published: (2024)
by: Chumachenko, Kateryna, et al.
Published: (2024)
HAC: Parameter-Efficient Hyperbolic Adaptation of CLIP for Zero-Shot VQA
by: Dibitonto, Francesco, et al.
Published: (2026)
by: Dibitonto, Francesco, et al.
Published: (2026)
MM-MoralBench: A MultiModal Moral Evaluation Benchmark for Large Vision-Language Models
by: Yan, Bei, et al.
Published: (2024)
by: Yan, Bei, et al.
Published: (2024)
From Seconds to Hours: Reviewing MultiModal Large Language Models on Comprehensive Long Video Understanding
by: Zou, Heqing, et al.
Published: (2024)
by: Zou, Heqing, et al.
Published: (2024)
Unified Latents (UL): How to train your latents
by: Heek, Jonathan, et al.
Published: (2026)
by: Heek, Jonathan, et al.
Published: (2026)
M$^2$CD: A Unified MultiModal Framework for Optical-SAR Change Detection with Mixture of Experts and Self-Distillation
by: Liu, Ziyuan, et al.
Published: (2025)
by: Liu, Ziyuan, et al.
Published: (2025)
MM-R5: MultiModal Reasoning-Enhanced ReRanker via Reinforcement Learning for Document Retrieval
by: Xu, Mingjun, et al.
Published: (2025)
by: Xu, Mingjun, et al.
Published: (2025)
UDON: Universal Dynamic Online distillatioN for generic image representations
by: Ypsilantis, Nikolaos-Antonios, et al.
Published: (2024)
by: Ypsilantis, Nikolaos-Antonios, et al.
Published: (2024)
Beyond Single Tokens: Distilling Discrete Diffusion Models via Discrete MMD
by: Hoogeboom, Emiel, et al.
Published: (2026)
by: Hoogeboom, Emiel, et al.
Published: (2026)
Multistep Distillation of Diffusion Models via Moment Matching
by: Salimans, Tim, et al.
Published: (2024)
by: Salimans, Tim, et al.
Published: (2024)
Simpler Diffusion (SiD2): 1.5 FID on ImageNet512 with pixel-space diffusion
by: Hoogeboom, Emiel, et al.
Published: (2024)
by: Hoogeboom, Emiel, et al.
Published: (2024)
Dual-Rate Diffusion: Accelerating diffusion models with an interleaved heavy-light network
by: Bartosh, Grigory, et al.
Published: (2026)
by: Bartosh, Grigory, et al.
Published: (2026)
LFM-3D: Learnable Feature Matching Across Wide Baselines Using 3D Signals
by: Karpur, Arjun, et al.
Published: (2023)
by: Karpur, Arjun, et al.
Published: (2023)
HierLoc: Hyperbolic Entity Embeddings for Hierarchical Visual Geolocation
by: Gadi, Hari Krishna, et al.
Published: (2026)
by: Gadi, Hari Krishna, et al.
Published: (2026)
HierEdit: Region-Aware Hierarchical Diffusion for Efficient High-Resolution Editing
by: Zhang, Yuyao, et al.
Published: (2026)
by: Zhang, Yuyao, et al.
Published: (2026)
Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives
by: Peirone, Simone Alberto, et al.
Published: (2025)
by: Peirone, Simone Alberto, et al.
Published: (2025)
Ready-to-React: Online Reaction Policy for Two-Character Interaction Generation
by: Cen, Zhi, et al.
Published: (2025)
by: Cen, Zhi, et al.
Published: (2025)
GRAM: Global Reasoning for Multi-Page VQA
by: Blau, Tsachi, et al.
Published: (2024)
by: Blau, Tsachi, et al.
Published: (2024)
Similar Items
-
HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
by: Zhang, Shiyi, et al.
Published: (2025) -
HAVIR: HierArchical Vision to Image Reconstruction using CLIP-Guided Versatile Diffusion
by: Zhang, Shiyi, et al.
Published: (2025) -
VQA Training Sets are Self-play Environments for Generating Few-shot Pools
by: Misiunas, Tautvydas, et al.
Published: (2024) -
MultiModal Action Conditioned Video Generation
by: Li, Yichen, et al.
Published: (2025) -
MultiModal Fine-tuning with Synthetic Captions
by: Enomoto, Shohei, et al.
Published: (2026)