Saved in:
| Main Authors: | Liu, Yang, Chen, Binglin, Zheng, Yongsen, Cheng, Lechao, Li, Guanbin, Lin, Liang |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.15734 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Cross-modal Causal Relation Alignment for Video Question Grounding
by: Chen, Weixing, et al.
Published: (2025)
by: Chen, Weixing, et al.
Published: (2025)
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
by: Jiang, Kaixuan, et al.
Published: (2025)
by: Jiang, Kaixuan, et al.
Published: (2025)
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
by: Wang, Kuo, et al.
Published: (2024)
by: Wang, Kuo, et al.
Published: (2024)
Dual-domain Adaptation Networks for Realistic Image Super-resolution
by: Fang, Chaowei, et al.
Published: (2025)
by: Fang, Chaowei, et al.
Published: (2025)
GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection
by: Li, Jiaming, et al.
Published: (2026)
by: Li, Jiaming, et al.
Published: (2026)
Decoupled Training with Local Reinforcement Fine-Tuning in Federated Learning
by: Ma, Yuting, et al.
Published: (2026)
by: Ma, Yuting, et al.
Published: (2026)
DDP-WM: Disentangled Dynamics Prediction for Efficient World Models
by: Yin, Shicheng, et al.
Published: (2026)
by: Yin, Shicheng, et al.
Published: (2026)
Cross-Modal Causal Intervention for Medical Report Generation
by: Chen, Weixing, et al.
Published: (2023)
by: Chen, Weixing, et al.
Published: (2023)
Towards Long-Horizon Vision-Language Navigation: Platform, Benchmark and Method
by: Song, Xinshuai, et al.
Published: (2024)
by: Song, Xinshuai, et al.
Published: (2024)
evMLP: An Efficient Event-Driven MLP Architecture for Vision
by: Zheng, Zhentan
Published: (2025)
by: Zheng, Zhentan
Published: (2025)
TadML: A fast temporal action detection with Mechanics-MLP
by: Deng, Bowen, et al.
Published: (2022)
by: Deng, Bowen, et al.
Published: (2022)
Towards Fine-Grained Emotion Understanding via Skeleton-Based Micro-Gesture Recognition
by: Xu, Hao, et al.
Published: (2025)
by: Xu, Hao, et al.
Published: (2025)
3DAffordSplat: Efficient Affordance Reasoning with 3D Gaussians
by: Wei, Zeming, et al.
Published: (2025)
by: Wei, Zeming, et al.
Published: (2025)
Credible Teacher for Semi-Supervised Object Detection in Open Scene
by: Zhuang, Jingyu, et al.
Published: (2024)
by: Zhuang, Jingyu, et al.
Published: (2024)
MLP Can Be A Good Transformer Learner
by: Lin, Sihao, et al.
Published: (2024)
by: Lin, Sihao, et al.
Published: (2024)
Self-Prophetic Decoding to Unlock Visual Search in LVLMs
by: He, Zhendong, et al.
Published: (2026)
by: He, Zhendong, et al.
Published: (2026)
Efficient Vision Language Model Fine-tuning for Text-based Person Anomaly Search
by: He, Jiayi, et al.
Published: (2025)
by: He, Jiayi, et al.
Published: (2025)
DSPNet: Dual-vision Scene Perception for Robust 3D Question Answering
by: Luo, Jingzhou, et al.
Published: (2025)
by: Luo, Jingzhou, et al.
Published: (2025)
One Model for All: Unified Try-On and Try-Off in Any Pose via LLM-Inspired Bidirectional Tweedie Diffusion
by: Liu, Jinxi, et al.
Published: (2025)
by: Liu, Jinxi, et al.
Published: (2025)
DenoiseGS: Gaussian Reconstruction Model for Burst Denoising
by: Cheng, Yongsen, et al.
Published: (2025)
by: Cheng, Yongsen, et al.
Published: (2025)
FUSU: A Multi-temporal-source Land Use Change Segmentation Dataset for Fine-grained Urban Semantic Understanding
by: Yuan, Shuai, et al.
Published: (2024)
by: Yuan, Shuai, et al.
Published: (2024)
MEIA: Multimodal Embodied Perception and Interaction in Unknown Environments
by: Liu, Yang, et al.
Published: (2024)
by: Liu, Yang, et al.
Published: (2024)
Novel Class Discovery for Ultra-Fine-Grained Visual Categorization
by: Liu, Yu, et al.
Published: (2024)
by: Liu, Yu, et al.
Published: (2024)
Sim-DETR: Unlock DETR for Temporal Sentence Grounding
by: Tang, Jiajin, et al.
Published: (2025)
by: Tang, Jiajin, et al.
Published: (2025)
StgcDiff: Spatial-Temporal Graph Condition Diffusion for Sign Language Transition Generation
by: He, Jiashu, et al.
Published: (2025)
by: He, Jiashu, et al.
Published: (2025)
Are VLMs Lost Between Sky and Space? LinkS$^2$Bench for UAV-Satellite Dynamic Cross-View Spatial Intelligence
by: Liu, Dian, et al.
Published: (2026)
by: Liu, Dian, et al.
Published: (2026)
LookasideVLN: Direction-Aware Aerial Vision-and-Language Navigation
by: Ning, Yuwei, et al.
Published: (2026)
by: Ning, Yuwei, et al.
Published: (2026)
Fine-grained Dynamic Network for Generic Event Boundary Detection
by: Zheng, Ziwei, et al.
Published: (2024)
by: Zheng, Ziwei, et al.
Published: (2024)
High-fidelity and Lip-synced Talking Face Synthesis via Landmark-based Diffusion Model
by: Zhong, Weizhi, et al.
Published: (2024)
by: Zhong, Weizhi, et al.
Published: (2024)
Focus Anywhere for Fine-grained Multi-page Document Understanding
by: Liu, Chenglong, et al.
Published: (2024)
by: Liu, Chenglong, et al.
Published: (2024)
ReVSI: Rebuilding Visual Spatial Intelligence Evaluation for Accurate Assessment of VLM 3D Reasoning
by: Zhang, Yiming, et al.
Published: (2026)
by: Zhang, Yiming, et al.
Published: (2026)
SpiralMLP: A Lightweight Vision MLP Architecture
by: Mu, Haojie, et al.
Published: (2024)
by: Mu, Haojie, et al.
Published: (2024)
SpatialLM: Training Large Language Models for Structured Indoor Modeling
by: Mao, Yongsen, et al.
Published: (2025)
by: Mao, Yongsen, et al.
Published: (2025)
Learning Background Prompts to Discover Implicit Knowledge for Open Vocabulary Object Detection
by: Li, Jiaming, et al.
Published: (2024)
by: Li, Jiaming, et al.
Published: (2024)
DAGSM: Disentangled Avatar Generation with GS-enhanced Mesh
by: Zhuang, Jingyu, et al.
Published: (2024)
by: Zhuang, Jingyu, et al.
Published: (2024)
WildVidFit: Video Virtual Try-On in the Wild via Image-Based Controlled Diffusion Models
by: He, Zijian, et al.
Published: (2024)
by: He, Zijian, et al.
Published: (2024)
Modality Alignment Meets Federated Broadcasting
by: Ma, Yuting, et al.
Published: (2024)
by: Ma, Yuting, et al.
Published: (2024)
TDEdit: A Unified Diffusion Framework for Text-Drag Guided Image Manipulation
by: Wang, Qihang, et al.
Published: (2025)
by: Wang, Qihang, et al.
Published: (2025)
Recovering Origin Destination Flows from Bus CCTV: Early Results from Nairobi and Kigali
by: Kyatha, Nthenya, et al.
Published: (2025)
by: Kyatha, Nthenya, et al.
Published: (2025)
FineParser: A Fine-grained Spatio-temporal Action Parser for Human-centric Action Quality Assessment
by: Xu, Jinglin, et al.
Published: (2024)
by: Xu, Jinglin, et al.
Published: (2024)
Similar Items
-
Cross-modal Causal Relation Alignment for Video Question Grounding
by: Chen, Weixing, et al.
Published: (2025) -
Beyond the Destination: A Novel Benchmark for Exploration-Aware Embodied Question Answering
by: Jiang, Kaixuan, et al.
Published: (2025) -
MarvelOVD: Marrying Object Recognition and Vision-Language Models for Robust Open-Vocabulary Object Detection
by: Wang, Kuo, et al.
Published: (2024) -
Dual-domain Adaptation Networks for Realistic Image Super-resolution
by: Fang, Chaowei, et al.
Published: (2025) -
GUIDED: Granular Understanding via Identification, Detection, and Discrimination for Fine-Grained Open-Vocabulary Object Detection
by: Li, Jiaming, et al.
Published: (2026)