Saved in:
| Main Authors: | Li, Peiyu, Huang, Xiaobao, Tian, Yijun, Chawla, Nitesh V. |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2409.12010 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Do Multimodal Large Language Models Understand Welding?
by: Khvatskii, Grigorii, et al.
Published: (2025)
by: Khvatskii, Grigorii, et al.
Published: (2025)
FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
by: Imajuku, Yuki, et al.
Published: (2024)
by: Imajuku, Yuki, et al.
Published: (2024)
RecipeGen: A Benchmark for Real-World Recipe Image Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)
by: Zhang, Ruoxuan, et al.
Published: (2025)
RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)
by: Zhang, Ruoxuan, et al.
Published: (2025)
AgentDrug: Utilizing Large Language Models in An Agentic Workflow for Zero-Shot Molecular Editing
by: Le, Khiem, et al.
Published: (2024)
by: Le, Khiem, et al.
Published: (2024)
VisualChef: Generating Visual Aids in Cooking via Mask Inpainting
by: Kuzyk, Oleh, et al.
Published: (2025)
by: Kuzyk, Oleh, et al.
Published: (2025)
MolX: Enhancing Large Language Models for Molecular Understanding With A Multi-Modal Extension
by: Le, Khiem, et al.
Published: (2024)
by: Le, Khiem, et al.
Published: (2024)
LayerFusion: Harmonized Multi-Layer Text-to-Image Generation with Generative Priors
by: Dalva, Yusuf, et al.
Published: (2024)
by: Dalva, Yusuf, et al.
Published: (2024)
Towards Unbiased Cross-Modal Representation Learning for Food Image-to-Recipe Retrieval
by: Wang, Qing, et al.
Published: (2025)
by: Wang, Qing, et al.
Published: (2025)
Generating Multimodal Images with GAN: Integrating Text, Image, and Style
by: Tan, Chaoyi, et al.
Published: (2025)
by: Tan, Chaoyi, et al.
Published: (2025)
OpenSearch-VL: An Open Recipe for Frontier Multimodal Search Agents
by: Chen, Shuang, et al.
Published: (2026)
by: Chen, Shuang, et al.
Published: (2026)
Retrieval Augmented Recipe Generation
by: Liu, Guoshan, et al.
Published: (2024)
by: Liu, Guoshan, et al.
Published: (2024)
FakeScope: Large Multimodal Expert Model for Transparent AI-Generated Image Forensics
by: Li, Yixuan, et al.
Published: (2025)
by: Li, Yixuan, et al.
Published: (2025)
Task-Generalized Adaptive Cross-Domain Learning for Multimodal Image Fusion
by: Wang, Mengyu, et al.
Published: (2025)
by: Wang, Mengyu, et al.
Published: (2025)
FlexID: Training-Free Flexible Identity Injection via Intent-Aware Modulation for Text-to-Image Generation
by: Li, Guandong, et al.
Published: (2026)
by: Li, Guandong, et al.
Published: (2026)
Plug-and-Play Logit Fusion for Heterogeneous Pathology Foundation Models
by: Huang, Gexin, et al.
Published: (2026)
by: Huang, Gexin, et al.
Published: (2026)
MTA-Agent: An Open Recipe for Multimodal Deep Search Agents
by: Peng, Xiangyu, et al.
Published: (2026)
by: Peng, Xiangyu, et al.
Published: (2026)
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models
by: Zhu, Jinguo, et al.
Published: (2025)
by: Zhu, Jinguo, et al.
Published: (2025)
GraphAvatar: Compact Head Avatars with GNN-Generated 3D Gaussians
by: Wei, Xiaobao, et al.
Published: (2024)
by: Wei, Xiaobao, et al.
Published: (2024)
Object-Conditioned Energy-Based Attention Map Alignment in Text-to-Image Diffusion Models
by: Zhang, Yasi, et al.
Published: (2024)
by: Zhang, Yasi, et al.
Published: (2024)
Uni-Fusion: Universal Continuous Mapping
by: Yuan, Yijun, et al.
Published: (2023)
by: Yuan, Yijun, et al.
Published: (2023)
Fusion of Foundation and Vision Transformer Model Features for Dermatoscopic Image Classification
by: Mahbod, Amirreza, et al.
Published: (2025)
by: Mahbod, Amirreza, et al.
Published: (2025)
LIT: Large Language Model Driven Intention Tracking for Proactive Human-Robot Collaboration -- A Robot Sous-Chef Application
by: Huang, Zhe, et al.
Published: (2024)
by: Huang, Zhe, et al.
Published: (2024)
Rank-Aware Agglomeration of Foundation Models for Immunohistochemistry Image Cell Counting
by: Huang, Zuqi, et al.
Published: (2025)
by: Huang, Zuqi, et al.
Published: (2025)
AdaFusion: Prompt-Guided Inference with Adaptive Fusion of Pathology Foundation Models
by: Xiao, Yuxiang, et al.
Published: (2025)
by: Xiao, Yuxiang, et al.
Published: (2025)
FastV-RAG: Towards Fast and Fine-Grained Video QA with Retrieval-Augmented Generation
by: Li, Gen, et al.
Published: (2026)
by: Li, Gen, et al.
Published: (2026)
Reverse Prompt: Cracking the Recipe Inside Text-to-Image Generation
by: Ren, Zhiyao, et al.
Published: (2025)
by: Ren, Zhiyao, et al.
Published: (2025)
Empirical Recipes for Efficient and Compact Vision-Language Models
by: Huang, Jiabo, et al.
Published: (2026)
by: Huang, Jiabo, et al.
Published: (2026)
Foundation Model Embeddings Meet Blended Emotions: A Multimodal Fusion Approach for the BLEMORE Challenge
by: Chapariniya, Masoumeh, et al.
Published: (2026)
by: Chapariniya, Masoumeh, et al.
Published: (2026)
Integrating Reinforcement Learning with Visual Generative Models: Foundations and Advances
by: Liang, Yuanzhi, et al.
Published: (2025)
by: Liang, Yuanzhi, et al.
Published: (2025)
A Generative Foundation Model for Multimodal Histopathology
by: Xiang, Jinxi, et al.
Published: (2026)
by: Xiang, Jinxi, et al.
Published: (2026)
Warm Diffusion: Recipe for Blur-Noise Mixture Diffusion Models
by: Hsueh, Hao-Chien, et al.
Published: (2025)
by: Hsueh, Hao-Chien, et al.
Published: (2025)
Deep Image-to-Recipe Translation
by: Ma, Jiangqin, et al.
Published: (2024)
by: Ma, Jiangqin, et al.
Published: (2024)
SegMamba: Long-range Sequential Modeling Mamba For 3D Medical Image Segmentation
by: Xing, Zhaohu, et al.
Published: (2024)
by: Xing, Zhaohu, et al.
Published: (2024)
Rethinking Normalization Strategies and Convolutional Kernels for Multimodal Image Fusion
by: He, Dan, et al.
Published: (2024)
by: He, Dan, et al.
Published: (2024)
FusionMamba: Dynamic Feature Enhancement for Multimodal Image Fusion with Mamba
by: Xie, Xinyu, et al.
Published: (2024)
by: Xie, Xinyu, et al.
Published: (2024)
Z-Image: An Efficient Image Generation Foundation Model with Single-Stream Diffusion Transformer
by: Image Team, et al.
Published: (2025)
by: Image Team, et al.
Published: (2025)
CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation
by: Zhang, Ruoxuan, et al.
Published: (2025)
by: Zhang, Ruoxuan, et al.
Published: (2025)
Rethinking Early-Fusion Strategies for Improved Multimodal Image Segmentation
by: Shen, Zhengwen, et al.
Published: (2025)
by: Shen, Zhengwen, et al.
Published: (2025)
RSATalker: Realistic Socially-Aware Talking Head Generation for Multi-Turn Conversation
by: Chen, Peng, et al.
Published: (2026)
by: Chen, Peng, et al.
Published: (2026)
Similar Items
-
Do Multimodal Large Language Models Understand Welding?
by: Khvatskii, Grigorii, et al.
Published: (2025) -
FoodMLLM-JP: Leveraging Multimodal Large Language Models for Japanese Recipe Generation
by: Imajuku, Yuki, et al.
Published: (2024) -
RecipeGen: A Benchmark for Real-World Recipe Image Generation
by: Zhang, Ruoxuan, et al.
Published: (2025) -
RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation
by: Zhang, Ruoxuan, et al.
Published: (2025) -
AgentDrug: Utilizing Large Language Models in An Agentic Workflow for Zero-Shot Molecular Editing
by: Le, Khiem, et al.
Published: (2024)