Saved in:
| Main Authors: | Zheng, Jiawen, Jia, Haonan, Li, Ming, Zheng, Yuhui, Zeng, Yufeng, Gao, Yang, Liang, Chen |
|---|---|
| Format: | Preprint |
| Published: |
2026
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2602.07495 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection
by: Zeng, YangChen
Published: (2025)
by: Zeng, YangChen
Published: (2025)
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026)
by: Zeng, Donghuo, et al.
Published: (2026)
Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation
by: Zhang, Xuesong, et al.
Published: (2024)
by: Zhang, Xuesong, et al.
Published: (2024)
Noise-Tolerant Learning for Audio-Visual Action Recognition
by: Han, Haochen, et al.
Published: (2022)
by: Han, Haochen, et al.
Published: (2022)
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning
by: Diao, Haiwen, et al.
Published: (2024)
by: Diao, Haiwen, et al.
Published: (2024)
Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
by: Liao, Liwei, et al.
Published: (2025)
by: Liao, Liwei, et al.
Published: (2025)
Wills Aligner: Multi-Subject Collaborative Brain Visual Decoding
by: Bao, Guangyin, et al.
Published: (2024)
by: Bao, Guangyin, et al.
Published: (2024)
PathVLM-R1: A Reinforcement Learning-Driven Reasoning Model for Pathology Visual-Language Tasks
by: Wu, Jianyu, et al.
Published: (2025)
by: Wu, Jianyu, et al.
Published: (2025)
ROI-Guided Point Cloud Geometry Compression Towards Human and Machine Vision
by: Liang, Xie, et al.
Published: (2025)
by: Liang, Xie, et al.
Published: (2025)
Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
by: Zheng, Guangting, et al.
Published: (2025)
by: Zheng, Guangting, et al.
Published: (2025)
Hierarchical Sub-action Tree for Continuous Sign Language Recognition
by: Yang, Dejie, et al.
Published: (2025)
by: Yang, Dejie, et al.
Published: (2025)
Visual Semantic Description Generation with MLLMs for Image-Text Matching
by: Chen, Junyu, et al.
Published: (2025)
by: Chen, Junyu, et al.
Published: (2025)
MM-MovieDubber: Towards Multi-Modal Learning for Multi-Modal Movie Dubbing
by: Zheng, Junjie, et al.
Published: (2025)
by: Zheng, Junjie, et al.
Published: (2025)
PersonaGest: Personalized Co-Speech Gesture Generation with Semantic-Guided Hierarchical Motion Representation
by: Zhao, Junchuan, et al.
Published: (2026)
by: Zhao, Junchuan, et al.
Published: (2026)
Towards Universal Modal Tracking with Online Dense Temporal Token Learning
by: Zheng, Yaozong, et al.
Published: (2025)
by: Zheng, Yaozong, et al.
Published: (2025)
HCNQA: Enhancing 3D VQA with Hierarchical Concentration Narrowing Supervision
by: Zhou, Shengli, et al.
Published: (2025)
by: Zhou, Shengli, et al.
Published: (2025)
Towards Flexible Evaluation for Generative Visual Question Answering
by: Ji, Huishan, et al.
Published: (2024)
by: Ji, Huishan, et al.
Published: (2024)
OS-HGAdapter: Open Semantic Hypergraph Adapter for Large Language Models Assisted Entropy-Enhanced Image-Text Alignment
by: Chen, Rongjun, et al.
Published: (2025)
by: Chen, Rongjun, et al.
Published: (2025)
Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
by: Liu, Ke, et al.
Published: (2025)
by: Liu, Ke, et al.
Published: (2025)
Learning Video Context as Interleaved Multimodal Sequences
by: Lin, Kevin Qinghong, et al.
Published: (2024)
by: Lin, Kevin Qinghong, et al.
Published: (2024)
SonoWorld: From One Image to a 3D Audio-Visual Scene
by: Jin, Derong, et al.
Published: (2026)
by: Jin, Derong, et al.
Published: (2026)
Teacher-Guided Pseudo Supervision and Cross-Modal Alignment for Audio-Visual Video Parsing
by: Chen, Yaru, et al.
Published: (2025)
by: Chen, Yaru, et al.
Published: (2025)
DIRECT: Video Mashup Creation via Hierarchical Multi-Agent Planning and Intent-Guided Editing
by: Li, Ke, et al.
Published: (2026)
by: Li, Ke, et al.
Published: (2026)
Delving Deeper: Hierarchical Visual Perception for Robust Video-Text Retrieval
by: Xie, Zequn, et al.
Published: (2026)
by: Xie, Zequn, et al.
Published: (2026)
Deep Reversible Consistency Learning for Cross-modal Retrieval
by: Pu, Ruitao, et al.
Published: (2025)
by: Pu, Ruitao, et al.
Published: (2025)
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
by: Li, Huilai, et al.
Published: (2026)
by: Li, Huilai, et al.
Published: (2026)
Hierarchical Knowledge Graphs for Story Understanding in Visual Narratives
by: Chen, Yi-Chun
Published: (2025)
by: Chen, Yi-Chun
Published: (2025)
Beyond Walking: A Large-Scale Image-Text Benchmark for Text-based Person Anomaly Search
by: Yang, Shuyu, et al.
Published: (2024)
by: Yang, Shuyu, et al.
Published: (2024)
Attributes-aware Visual Emotion Representation Learning
by: Maharjan, Rahul Singh, et al.
Published: (2025)
by: Maharjan, Rahul Singh, et al.
Published: (2025)
Factorized Learning for Temporally Grounded Video-Language Models
by: Zeng, Wenzheng, et al.
Published: (2025)
by: Zeng, Wenzheng, et al.
Published: (2025)
ASR-enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval
by: Zhao, Ruixiang, et al.
Published: (2024)
by: Zhao, Ruixiang, et al.
Published: (2024)
Beyond Patches: Global-aware Autoregressive Model for Multimodal Few-Shot Font Generation
by: Cai, Haonan, et al.
Published: (2026)
by: Cai, Haonan, et al.
Published: (2026)
Hypergraph Tversky-Aware Domain Incremental Learning for Brain Tumor Segmentation with Missing Modalities
by: Wang, Junze, et al.
Published: (2025)
by: Wang, Junze, et al.
Published: (2025)
VCap: Hypergeometric Rewards for Weak-to-Strong Visual Captioning
by: Lu, Xingyu, et al.
Published: (2026)
by: Lu, Xingyu, et al.
Published: (2026)
Causal Debiasing for Visual Commonsense Reasoning
by: Zou, Jiayi, et al.
Published: (2025)
by: Zou, Jiayi, et al.
Published: (2025)
CAMeL: Cross-modality Adaptive Meta-Learning for Text-based Person Retrieval
by: Yu, Hang, et al.
Published: (2025)
by: Yu, Hang, et al.
Published: (2025)
Loc4Plan: Locating Before Planning for Outdoor Vision and Language Navigation
by: Tian, Huilin, et al.
Published: (2024)
by: Tian, Huilin, et al.
Published: (2024)
Language-Guided Diffusion Model for Visual Grounding
by: Chen, Sijia, et al.
Published: (2023)
by: Chen, Sijia, et al.
Published: (2023)
LinVT: Empower Your Image-level Large Language Model to Understand Videos
by: Gao, Lishuai, et al.
Published: (2024)
by: Gao, Lishuai, et al.
Published: (2024)
CustomContrast: A Multilevel Contrastive Perspective For Subject-Driven Text-to-Image Customization
by: Chen, Nan, et al.
Published: (2024)
by: Chen, Nan, et al.
Published: (2024)
Similar Items
-
HMPE:HeatMap Embedding for Efficient Transformer-Based Small Object Detection
by: Zeng, YangChen
Published: (2025) -
Hierarchical Semantic Correlation-Aware Masked Autoencoder for Unsupervised Audio-Visual Representation Learning
by: Zeng, Donghuo, et al.
Published: (2026) -
Agent Journey Beyond RGB: Hierarchical Semantic-Spatial Representation Enrichment for Vision-and-Language Navigation
by: Zhang, Xuesong, et al.
Published: (2024) -
Noise-Tolerant Learning for Audio-Visual Action Recognition
by: Han, Haochen, et al.
Published: (2022) -
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning
by: Diao, Haiwen, et al.
Published: (2024)