Saved in:
| Main Authors: | Wang, Xiao, Wang, Shiao, Ding, Yuhe, Li, Yuehang, Wu, Wentao, Rong, Yao, Kong, Weizhe, Huang, Ju, Li, Shihao, Yang, Haoxiang, Wang, Ziwen, Jiang, Bo, Li, Chenglong, Wang, Yaowei, Tian, Yonghong, Tang, Jin |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.09516 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
by: Wang, Xiao, et al.
Published: (2023)
by: Wang, Xiao, et al.
Published: (2023)
SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
by: Wang, Xiao, et al.
Published: (2023)
by: Wang, Xiao, et al.
Published: (2023)
T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval
by: Wang, Xiao, et al.
Published: (2026)
by: Wang, Xiao, et al.
Published: (2026)
Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images
by: Yuan, Bo, et al.
Published: (2024)
by: Yuan, Bo, et al.
Published: (2024)
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
by: Lu, Zhenyu, et al.
Published: (2025)
by: Lu, Zhenyu, et al.
Published: (2025)
Imagine Before Concentration: Diffusion-Guided Registers Enhance Partially Relevant Video Retrieval
by: Li, Jun, et al.
Published: (2026)
by: Li, Jun, et al.
Published: (2026)
HDiffTG: A Lightweight Hybrid Diffusion-Transformer-GCN Architecture for 3D Human Pose Estimation
by: Fu, Yajie, et al.
Published: (2025)
by: Fu, Yajie, et al.
Published: (2025)
AutoSSVH: Exploring Automated Frame Sampling for Efficient Self-Supervised Video Hashing
by: Lian, Niu, et al.
Published: (2025)
by: Lian, Niu, et al.
Published: (2025)
Distilling Implicit Multimodal Knowledge into Large Language Models for Zero-Resource Dialogue Generation
by: Zhang, Bo, et al.
Published: (2024)
by: Zhang, Bo, et al.
Published: (2024)
SequencePAR: Understanding Pedestrian Attributes via A Sequence Generation Paradigm
by: Jin, Jiandong, et al.
Published: (2023)
by: Jin, Jiandong, et al.
Published: (2023)
BlobCtrl: Taming Controllable Blob for Element-level Image Editing
by: Li, Yaowei, et al.
Published: (2025)
by: Li, Yaowei, et al.
Published: (2025)
Order Is Not Layout: Order-to-Space Bias in Image Generation
by: Zhang, Yongkang, et al.
Published: (2026)
by: Zhang, Yongkang, et al.
Published: (2026)
Image Conductor: Precision Control for Interactive Video Synthesis
by: Li, Yaowei, et al.
Published: (2024)
by: Li, Yaowei, et al.
Published: (2024)
Text2Sign Diffusion: A Generative Approach for Gloss-Free Sign Language Production
by: Feng, Liqian, et al.
Published: (2025)
by: Feng, Liqian, et al.
Published: (2025)
HSVLT: Hierarchical Scale-Aware Vision-Language Transformer for Multi-Label Image Classification
by: Ouyang, Shuyi, et al.
Published: (2024)
by: Ouyang, Shuyi, et al.
Published: (2024)
Remember Past, Anticipate Future: Learning Continual Multimodal Misinformation Detectors
by: Wang, Bing, et al.
Published: (2025)
by: Wang, Bing, et al.
Published: (2025)
CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception
by: Li, Liupeng, et al.
Published: (2026)
by: Li, Liupeng, et al.
Published: (2026)
PixelThink: Towards Efficient Chain-of-Pixel Reasoning
by: Wang, Song, et al.
Published: (2025)
by: Wang, Song, et al.
Published: (2025)
LaF-GRPO: In-Situ Navigation Instruction Generation for the Visually Impaired via GRPO with LLM-as-Follower Reward
by: Zhao, Yi, et al.
Published: (2025)
by: Zhao, Yi, et al.
Published: (2025)
Revisiting Uncertainty: On Evidential Learning for Partially Relevant Video Retrieval
by: Li, Jun, et al.
Published: (2026)
by: Li, Jun, et al.
Published: (2026)
HLFormer: Enhancing Partially Relevant Video Retrieval with Hyperbolic Learning
by: Li, Jun, et al.
Published: (2025)
by: Li, Jun, et al.
Published: (2025)
HR-INR: Continuous Space-Time Video Super-Resolution via Event Camera
by: Lu, Yunfan, et al.
Published: (2024)
by: Lu, Yunfan, et al.
Published: (2024)
A Survey of Multimodal Large Language Model from A Data-centric Perspective
by: Bai, Tianyi, et al.
Published: (2024)
by: Bai, Tianyi, et al.
Published: (2024)
Look, Compare and Draw: Differential Query Transformer for Automatic Oil Painting
by: Liu, Lingyu, et al.
Published: (2026)
by: Liu, Lingyu, et al.
Published: (2026)
Towards Generalizable Deepfake Detection via Forgery-aware Audio-Visual Adaptation: A Variational Bayesian Approach
by: Nie, Fan, et al.
Published: (2025)
by: Nie, Fan, et al.
Published: (2025)
Deep Shape-Texture Statistics for Completely Blind Image Quality Evaluation
by: Li, Yixuan, et al.
Published: (2024)
by: Li, Yixuan, et al.
Published: (2024)
Enhancing Multimodal Misinformation Detection by Replaying the Whole Story from Image Modality Perspective
by: Wang, Bing, et al.
Published: (2025)
by: Wang, Bing, et al.
Published: (2025)
SegCompass: Exploring Interpretable Alignment with Sparse Autoencoders for Enhanced Reasoning Segmentation
by: Lu, Zhenyu, et al.
Published: (2026)
by: Lu, Zhenyu, et al.
Published: (2026)
Harmfully Manipulated Images Matter in Multimodal Misinformation Detection
by: Wang, Bing, et al.
Published: (2024)
by: Wang, Bing, et al.
Published: (2024)
Sign Language Translation using Frame and Event Stream: Benchmark Dataset and Algorithms
by: Wang, Xiao, et al.
Published: (2025)
by: Wang, Xiao, et al.
Published: (2025)
A Survey of Information Disorder on Video-Sharing Platforms
by: Li, Meiyu, et al.
Published: (2025)
by: Li, Meiyu, et al.
Published: (2025)
M3FAS: An Accurate and Robust MultiModal Mobile Face Anti-Spoofing System
by: Kong, Chenqi, et al.
Published: (2023)
by: Kong, Chenqi, et al.
Published: (2023)
CartoAgent: a multimodal large language model-powered multi-agent cartographic framework for map style transfer and evaluation
by: Wang, Chenglong, et al.
Published: (2025)
by: Wang, Chenglong, et al.
Published: (2025)
L3TC: Leveraging RWKV for Learned Lossless Low-Complexity Text Compression
by: Zhang, Junxuan, et al.
Published: (2024)
by: Zhang, Junxuan, et al.
Published: (2024)
Retrieval-Augmented Multimodal Model for Fake News Detection
by: Li, Yiheng, et al.
Published: (2026)
by: Li, Yiheng, et al.
Published: (2026)
Generalized Face Forgery Detection via Adaptive Learning for Pre-trained Vision Transformer
by: Luo, Anwei, et al.
Published: (2023)
by: Luo, Anwei, et al.
Published: (2023)
Listening to the Unspoken: Exploring "365" Aspects of Multimodal Interview Performance Assessment
by: Li, Jia, et al.
Published: (2025)
by: Li, Jia, et al.
Published: (2025)
A Novel Approach to Industrial Defect Generation through Blended Latent Diffusion Model with Online Adaptation
by: Li, Hanxi, et al.
Published: (2024)
by: Li, Hanxi, et al.
Published: (2024)
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
by: Zhang, Zhenxing, et al.
Published: (2024)
by: Zhang, Zhenxing, et al.
Published: (2024)
A Light-weight Transformer-based Self-supervised Matching Network for Heterogeneous Images
by: Zhang, Wang, et al.
Published: (2024)
by: Zhang, Wang, et al.
Published: (2024)
Similar Items
-
Large-scale Multi-Modal Pre-trained Models: A Comprehensive Survey
by: Wang, Xiao, et al.
Published: (2023) -
SSTFormer: Bridging Spiking Neural Network and Memory Support Transformer for Frame-Event based Recognition
by: Wang, Xiao, et al.
Published: (2023) -
T2I-VeRW: Part-level Fine-grained Perception for Text-to-Image Vehicle Retrieval
by: Wang, Xiao, et al.
Published: (2026) -
Continual Panoptic Perception: Towards Multi-modal Incremental Interpretation of Remote Sensing Images
by: Yuan, Bo, et al.
Published: (2024) -
CoPRS: Learning Positional Prior from Chain-of-Thought for Reasoning Segmentation
by: Lu, Zhenyu, et al.
Published: (2025)