Saved in:
| Main Authors: | Yu, Shiyao, Wang, Zi-An, Yin, Kangning, Tian, Zheng, Zhang, Mingyuan, Si, Weixin, Zou, Shihao |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.23188 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
by: Yin, Kangning, et al.
Published: (2024)
by: Yin, Kangning, et al.
Published: (2024)
Semantics-Aware Human Motion Generation from Audio Instructions
by: Wang, Zi-An, et al.
Published: (2025)
by: Wang, Zi-An, et al.
Published: (2025)
RACon: Retrieval-Augmented Simulated Character Locomotion Control
by: Mu, Yuxuan, et al.
Published: (2024)
by: Mu, Yuxuan, et al.
Published: (2024)
Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer
by: Zou, Shihao, et al.
Published: (2023)
by: Zou, Shihao, et al.
Published: (2023)
TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
by: Liu, Zehua, et al.
Published: (2026)
by: Liu, Zehua, et al.
Published: (2026)
Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
by: Zeng, Yangchen, et al.
Published: (2026)
by: Zeng, Yangchen, et al.
Published: (2026)
NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification
by: Li, Shihao, et al.
Published: (2025)
by: Li, Shihao, et al.
Published: (2025)
Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces
by: Galoaa, Bishoy, et al.
Published: (2025)
by: Galoaa, Bishoy, et al.
Published: (2025)
Large Motion Model for Unified Multi-Modal Motion Generation
by: Zhang, Mingyuan, et al.
Published: (2024)
by: Zhang, Mingyuan, et al.
Published: (2024)
CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval
by: Hossain, Md Aminur, et al.
Published: (2026)
by: Hossain, Md Aminur, et al.
Published: (2026)
Toward Real-Time Surgical Scene Segmentation via a Spike-Driven Video Transformer with Spike-Informed Pretraining
by: Zou, Shihao, et al.
Published: (2025)
by: Zou, Shihao, et al.
Published: (2025)
AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
by: Xue, Junxiao, et al.
Published: (2026)
by: Xue, Junxiao, et al.
Published: (2026)
Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
by: Zhang, Yao, et al.
Published: (2026)
by: Zhang, Yao, et al.
Published: (2026)
WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval
by: Ren, Junlong, et al.
Published: (2025)
by: Ren, Junlong, et al.
Published: (2025)
Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
by: Tang, Yin, et al.
Published: (2025)
by: Tang, Yin, et al.
Published: (2025)
Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
by: Liang, Weixin, et al.
Published: (2025)
by: Liang, Weixin, et al.
Published: (2025)
Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning
by: Chen, Hanmo, et al.
Published: (2026)
by: Chen, Hanmo, et al.
Published: (2026)
Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval
by: Wang, Shijie, et al.
Published: (2026)
by: Wang, Shijie, et al.
Published: (2026)
Learning Parallax for Stereo Event-based Motion Deblurring
by: Lin, Mingyuan, et al.
Published: (2023)
by: Lin, Mingyuan, et al.
Published: (2023)
Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)
by: Huang, Jincai, et al.
Published: (2026)
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
by: Zhou, Junjie, et al.
Published: (2024)
by: Zhou, Junjie, et al.
Published: (2024)
Fine-Grained Scene Image Classification with Modality-Agnostic Adapter
by: Wang, Yiqun, et al.
Published: (2024)
by: Wang, Yiqun, et al.
Published: (2024)
PAS-Mamba: Phase-Amplitude-Spatial State Space Model for MRI Reconstruction
by: Kui, Xiaoyan, et al.
Published: (2026)
by: Kui, Xiaoyan, et al.
Published: (2026)
Language-driven Fine-grained Retrieval
by: Wang, Shijie, et al.
Published: (2025)
by: Wang, Shijie, et al.
Published: (2025)
A-JEPA: Joint-Embedding Predictive Architecture Can Listen
by: Fei, Zhengcong, et al.
Published: (2023)
by: Fei, Zhengcong, et al.
Published: (2023)
Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
by: Wang, Wei, et al.
Published: (2024)
by: Wang, Wei, et al.
Published: (2024)
Multi-Modal Generative Embedding Model
by: Ma, Feipeng, et al.
Published: (2024)
by: Ma, Feipeng, et al.
Published: (2024)
Multi-entity Video Transformers for Fine-Grained Video Representation Learning
by: Walmer, Matthew, et al.
Published: (2023)
by: Walmer, Matthew, et al.
Published: (2023)
Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach
by: Zou, Mian, et al.
Published: (2024)
by: Zou, Mian, et al.
Published: (2024)
Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
by: Zhang, Zhengxuan, et al.
Published: (2025)
by: Zhang, Zhengxuan, et al.
Published: (2025)
Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
by: Zhu, Minghao, et al.
Published: (2023)
by: Zhu, Minghao, et al.
Published: (2023)
MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
by: Ma, Hongxu, et al.
Published: (2025)
by: Ma, Hongxu, et al.
Published: (2025)
MotionCharacter: Fine-Grained Motion Controllable Human Video Generation
by: Fang, Haopeng, et al.
Published: (2024)
by: Fang, Haopeng, et al.
Published: (2024)
MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
by: Du, Yipeng, et al.
Published: (2025)
by: Du, Yipeng, et al.
Published: (2025)
Retrieval Robust to Object Motion Blur
by: Zou, Rong, et al.
Published: (2024)
by: Zou, Rong, et al.
Published: (2024)
Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification
by: Lan, Tian, et al.
Published: (2025)
by: Lan, Tian, et al.
Published: (2025)
FineXtrol: Controllable Motion Generation via Fine-Grained Text
by: Shen, Keming, et al.
Published: (2025)
by: Shen, Keming, et al.
Published: (2025)
Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration
by: Mok, Tony C. W., et al.
Published: (2024)
by: Mok, Tony C. W., et al.
Published: (2024)
Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
by: Trinh, Quoc-Huy
Published: (2025)
by: Trinh, Quoc-Huy
Published: (2025)
Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"
by: Korbar, Bruno, et al.
Published: (2025)
by: Korbar, Bruno, et al.
Published: (2025)
Similar Items
-
Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
by: Yin, Kangning, et al.
Published: (2024) -
Semantics-Aware Human Motion Generation from Audio Instructions
by: Wang, Zi-An, et al.
Published: (2025) -
RACon: Retrieval-Augmented Simulated Character Locomotion Control
by: Mu, Yuxuan, et al.
Published: (2024) -
Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer
by: Zou, Shihao, et al.
Published: (2023) -
TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
by: Liu, Zehua, et al.
Published: (2026)