:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Yu, Shiyao, Wang, Zi-An, Yin, Kangning, Tian, Zheng, Zhang, Mingyuan, Si, Weixin, Zou, Shihao
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2507.23188
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Tri-Modal Motion Retrieval by Learning a Joint Embedding Space
by: Yin, Kangning, et al.
Published: (2024)

Semantics-Aware Human Motion Generation from Audio Instructions
by: Wang, Zi-An, et al.
Published: (2025)

RACon: Retrieval-Augmented Simulated Character Locomotion Control
by: Mu, Yuxuan, et al.
Published: (2024)

Highly Efficient 3D Human Pose Tracking from Events with Spiking Spatiotemporal Transformer
by: Zou, Shihao, et al.
Published: (2023)

TempDiffReg: Temporal Diffusion Model for Non-Rigid 2D-3D Vascular Registration
by: Liu, Zehua, et al.
Published: (2026)

Learning Where to Embed: Noise-Aware Positional Embedding for Query Retrieval in Small-Object Detection
by: Zeng, Yangchen, et al.
Published: (2026)

NEXT: Multi-Grained Mixture of Experts via Text-Modulation for Multi-Modal Object Re-Identification
by: Li, Shihao, et al.
Published: (2025)

Lang2Motion: Bridging Language and Motion through Joint Embedding Spaces
by: Galoaa, Bishoy, et al.
Published: (2025)

Large Motion Model for Unified Multi-Modal Motion Generation
by: Zhang, Mingyuan, et al.
Published: (2024)

CR-JEPA: Cross-Modal Joint-Embedding Predictive Learning for Remote Sensing Image Retrieval
by: Hossain, Md Aminur, et al.
Published: (2026)

Toward Real-Time Surgical Scene Segmentation via a Spike-Driven Video Transformer with Spike-Informed Pretraining
by: Zou, Shihao, et al.
Published: (2025)

AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
by: Xue, Junxiao, et al.
Published: (2026)

Fine-grained Motion Retrieval via Joint-Angle Motion Images and Token-Patch Late Interaction
by: Zhang, Yao, et al.
Published: (2026)

WaMo: Wavelet-Enhanced Multi-Frequency Trajectory Analysis for Fine-Grained Text-Motion Retrieval
by: Ren, Junlong, et al.
Published: (2025)

Multi-Grained Compositional Visual Clue Learning for Image Intent Recognition
by: Tang, Yin, et al.
Published: (2025)

Mixture-of-Mamba: Enhancing Multi-Modal State-Space Models with Modality-Aware Sparsity
by: Liang, Weixin, et al.
Published: (2025)

Beyond Global Alignment: Fine-Grained Motion-Language Retrieval via Pyramidal Shapley-Taylor Learning
by: Chen, Hanmo, et al.
Published: (2026)

Learning to Align Generative Appearance Priors for Fine-grained Image Retrieval
by: Wang, Shijie, et al.
Published: (2026)

Learning Parallax for Stereo Event-based Motion Deblurring
by: Lin, Mingyuan, et al.
Published: (2023)

Towards Unified Surgical Scene Understanding:Bridging Reasoning and Grounding via MLLMs
by: Huang, Jincai, et al.
Published: (2026)

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval
by: Zhou, Junjie, et al.
Published: (2024)

Fine-Grained Scene Image Classification with Modality-Agnostic Adapter
by: Wang, Yiqun, et al.
Published: (2024)

PAS-Mamba: Phase-Amplitude-Spatial State Space Model for MRI Reconstruction
by: Kui, Xiaoyan, et al.
Published: (2026)

Language-driven Fine-grained Retrieval
by: Wang, Shijie, et al.
Published: (2025)

A-JEPA: Joint-Embedding Predictive Architecture Can Listen
by: Fei, Zhengcong, et al.
Published: (2023)

Advancing Fine-Grained Visual Understanding with Multi-Scale Alignment in Multi-Modal Models
by: Wang, Wei, et al.
Published: (2024)

Multi-Modal Generative Embedding Model
by: Ma, Feipeng, et al.
Published: (2024)

Multi-entity Video Transformers for Fine-Grained Video Representation Learning
by: Walmer, Matthew, et al.
Published: (2023)

Semantics-Oriented Multitask Learning for DeepFake Detection: A Joint Embedding Approach
by: Zou, Mian, et al.
Published: (2024)

Fine-Grained Knowledge Structuring and Retrieval for Visual Question Answering
by: Zhang, Zhengxuan, et al.
Published: (2025)

Fine-Grained Spatiotemporal Motion Alignment for Contrastive Video Representation Learning
by: Zhu, Minghao, et al.
Published: (2023)

MS-DETR: Towards Effective Video Moment Retrieval and Highlight Detection by Joint Motion-Semantic Learning
by: Ma, Hongxu, et al.
Published: (2025)

MotionCharacter: Fine-Grained Motion Controllable Human Video Generation
by: Fang, Haopeng, et al.
Published: (2024)

MotionSight: Boosting Fine-Grained Motion Understanding in Multimodal LLMs
by: Du, Yipeng, et al.
Published: (2025)

Retrieval Robust to Object Motion Blur
by: Zou, Rong, et al.
Published: (2024)

Diffusion-Based Cross-Modal Feature Extraction for Multi-Label Classification
by: Lan, Tian, et al.
Published: (2025)

FineXtrol: Controllable Motion Generation via Fine-Grained Text
by: Shen, Keming, et al.
Published: (2025)

Modality-Agnostic Structural Image Representation Learning for Deformable Multi-Modality Medical Image Registration
by: Mok, Tony C. W., et al.
Published: (2024)

Viper-F1: Fast and Fine-Grained Multimodal Understanding with Cross-Modal State-Space Modulation
by: Trinh, Quoc-Huy
Published: (2025)

Personalizing Retrieval using Joint Embeddings or "the Return of Fluffy"
by: Korbar, Bruno, et al.
Published: (2025)