Saved in:
| Main Authors: | Zhou, Shengli, Zhu, Jianuo, Huang, Qilin, Wang, Fangjing, Zhang, Yanfu, Zheng, Feng |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2507.01800 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
by: Zhou, Shengli, et al.
Published: (2026)
by: Zhou, Shengli, et al.
Published: (2026)
A Hierarchical Compression Technique for 3D Gaussian Splatting Compression
by: Huang, He, et al.
Published: (2024)
by: Huang, He, et al.
Published: (2024)
GaussianForest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling
by: Zhang, Fengyi, et al.
Published: (2024)
by: Zhang, Fengyi, et al.
Published: (2024)
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
by: Zhou, Sheng, et al.
Published: (2025)
by: Zhou, Sheng, et al.
Published: (2025)
Modularized Zero-shot VQA with Pre-trained Models
by: Cao, Rui, et al.
Published: (2023)
by: Cao, Rui, et al.
Published: (2023)
DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment
by: Li, Xinyue, et al.
Published: (2026)
by: Li, Xinyue, et al.
Published: (2026)
UniAV: Unified Audio-Visual Perception for Multi-Task Video Event Localization
by: Geng, Tiantian, et al.
Published: (2024)
by: Geng, Tiantian, et al.
Published: (2024)
Context-aware TFL: A Universal Context-aware Contrastive Learning Framework for Temporal Forgery Localization
by: Yin, Qilin, et al.
Published: (2025)
by: Yin, Qilin, et al.
Published: (2025)
Hierarchical Action Recognition: A Contrastive Video-Language Approach with Hierarchical Interactions
by: Zhang, Rui, et al.
Published: (2024)
by: Zhang, Rui, et al.
Published: (2024)
Zero-Shot Visual Grounding in 3D Gaussians via View Retrieval
by: Liao, Liwei, et al.
Published: (2025)
by: Liao, Liwei, et al.
Published: (2025)
Hierarchical Refinement of Universal Multimodal Attacks on Vision-Language Models
by: Zhang, Peng-Fei, et al.
Published: (2026)
by: Zhang, Peng-Fei, et al.
Published: (2026)
Towards Signboard-Oriented Visual Question Answering: ViSignVQA Dataset, Method and Benchmark
by: Nguyen, Hieu Minh, et al.
Published: (2025)
by: Nguyen, Hieu Minh, et al.
Published: (2025)
Adaptive 3D Gaussian Splatting Video Streaming
by: Gong, Han, et al.
Published: (2025)
by: Gong, Han, et al.
Published: (2025)
Learning Brain Representation with Hierarchical Visual Embeddings
by: Zheng, Jiawen, et al.
Published: (2026)
by: Zheng, Jiawen, et al.
Published: (2026)
Querying Autonomous Vehicle Point Clouds: Enhanced by 3D Object Counting with CounterNet
by: Zhang, Xiaoyu, et al.
Published: (2025)
by: Zhang, Xiaoyu, et al.
Published: (2025)
ReLaX-VQA: Residual Fragment and Layer Stack Extraction for Enhancing Video Quality Assessment
by: Wang, Xinyi, et al.
Published: (2024)
by: Wang, Xinyi, et al.
Published: (2024)
EAR: Enhancing Uni-Modal Representations for Weakly Supervised Audio-Visual Video Parsing
by: Li, Huilai, et al.
Published: (2026)
by: Li, Huilai, et al.
Published: (2026)
Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
by: Zheng, Guangting, et al.
Published: (2025)
by: Zheng, Guangting, et al.
Published: (2025)
Advancing Weakly-Supervised Audio-Visual Video Parsing via Segment-wise Pseudo Labeling
by: Zhou, Jinxing, et al.
Published: (2024)
by: Zhou, Jinxing, et al.
Published: (2024)
Enhancing Vision-Language Tracking by Effectively Converting Textual Cues into Visual Cues
by: Feng, X., et al.
Published: (2024)
by: Feng, X., et al.
Published: (2024)
Hierarchical Sub-action Tree for Continuous Sign Language Recognition
by: Yang, Dejie, et al.
Published: (2025)
by: Yang, Dejie, et al.
Published: (2025)
Enhancing 3D Gaussian Splatting Compression via Spatial Condition-based Prediction
by: Ma, Jingui, et al.
Published: (2025)
by: Ma, Jingui, et al.
Published: (2025)
DanceCamera3D: 3D Camera Movement Synthesis with Music and Dance
by: Wang, Zixuan, et al.
Published: (2024)
by: Wang, Zixuan, et al.
Published: (2024)
Optimizing Multimodal LLMs for Egocentric Video Understanding: A Solution for the HD-EPIC VQA Challenge
by: Yang, Sicheng, et al.
Published: (2026)
by: Yang, Sicheng, et al.
Published: (2026)
VideoForest: Person-Anchored Hierarchical Reasoning for Cross-Video Question Answering
by: Meng, Yiran, et al.
Published: (2025)
by: Meng, Yiran, et al.
Published: (2025)
Hierarchical Textual Knowledge for Enhanced Image Clustering
by: Zhong, Yijie, et al.
Published: (2026)
by: Zhong, Yijie, et al.
Published: (2026)
Mono3DVG-EnSD: Enhanced Spatial-aware and Dimension-decoupled Text Encoding for Monocular 3D Visual Grounding
by: Li, Yuzhen, et al.
Published: (2025)
by: Li, Yuzhen, et al.
Published: (2025)
HiScene: Creating Hierarchical 3D Scenes with Isometric View Generation
by: Dong, Wenqi, et al.
Published: (2025)
by: Dong, Wenqi, et al.
Published: (2025)
DIVA-VQA: Detecting Inter-frame Variations in UGC Video Quality
by: Wang, Xinyi, et al.
Published: (2025)
by: Wang, Xinyi, et al.
Published: (2025)
SCENEFORGE: Enhancing 3D-text alignment with Structured Scene Compositions
by: Sbrolli, Cristian, et al.
Published: (2025)
by: Sbrolli, Cristian, et al.
Published: (2025)
MVBIND: Self-Supervised Music Recommendation For Videos Via Embedding Space Binding
by: Teng, Jiajie, et al.
Published: (2024)
by: Teng, Jiajie, et al.
Published: (2024)
Learn 3D VQA Better with Active Selection and Reannotation
by: Zhou, Shengli, et al.
Published: (2025)
by: Zhou, Shengli, et al.
Published: (2025)
Learning Generalizable and Efficient Image Watermarking via Hierarchical Two-Stage Optimization
by: Liu, Ke, et al.
Published: (2025)
by: Liu, Ke, et al.
Published: (2025)
Anchoring Emotions in Text: Robust Multimodal Fusion for Mimicry Intensity Estimation
by: Zhu, Lingsi, et al.
Published: (2026)
by: Zhu, Lingsi, et al.
Published: (2026)
MM-Point: Multi-View Information-Enhanced Multi-Modal Self-Supervised 3D Point Cloud Understanding
by: Yu, Hai-Tao, et al.
Published: (2024)
by: Yu, Hai-Tao, et al.
Published: (2024)
Decompose and Transfer: CoT-Prompting Enhanced Alignment for Open-Vocabulary Temporal Action Detection
by: Zhu, Sa, et al.
Published: (2026)
by: Zhu, Sa, et al.
Published: (2026)
Enhancing Interactive Image Retrieval With Query Rewriting Using Large Language Models and Vision Language Models
by: Zhu, Hongyi, et al.
Published: (2024)
by: Zhu, Hongyi, et al.
Published: (2024)
InstructHumans: Editing Animated 3D Human Textures with Instructions
by: Zhu, Jiayin, et al.
Published: (2024)
by: Zhu, Jiayin, et al.
Published: (2024)
Ego3DT: Tracking Every 3D Object in Ego-centric Videos
by: Hao, Shengyu, et al.
Published: (2024)
by: Hao, Shengyu, et al.
Published: (2024)
ASAP: Advancing Semantic Alignment Promotes Multi-Modal Manipulation Detecting and Grounding
by: Zhang, Zhenxing, et al.
Published: (2024)
by: Zhang, Zhenxing, et al.
Published: (2024)
Similar Items
-
Scalable Object Relation Encoding for Better 3D Spatial Reasoning in Large Language Models
by: Zhou, Shengli, et al.
Published: (2026) -
A Hierarchical Compression Technique for 3D Gaussian Splatting Compression
by: Huang, He, et al.
Published: (2024) -
GaussianForest: Hierarchical-Hybrid 3D Gaussian Splatting for Compressed Scene Modeling
by: Zhang, Fengyi, et al.
Published: (2024) -
EgoTextVQA: Towards Egocentric Scene-Text Aware Video Question Answering
by: Zhou, Sheng, et al.
Published: (2025) -
Modularized Zero-shot VQA with Pre-trained Models
by: Cao, Rui, et al.
Published: (2023)