:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Ding, Xi, Wang, Lei
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2412.13845
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Learning Time in Static Classifiers
by: Ding, Xi, et al.
Published: (2025)

Quo Vadis, Anomaly Detection? LLMs and VLMs in the Spotlight
by: Ding, Xi, et al.
Published: (2024)

Subspace Kernel Learning on Tensor Sequences
by: Wang, Lei, et al.
Published: (2026)

Graph Your Own Prompt
by: Ding, Xi, et al.
Published: (2025)

Trust-Aware Joint Feature-Prediction Discrepancy for Robust Domain Adaptation
by: Ding, Xi, et al.
Published: (2026)

Optimization-Free Test-Time Adaptation for Cross-Person Activity Recognition
by: Wang, Shuoyuan, et al.
Published: (2023)

Do Understanding and Generation Fight? A Diagnostic Study of DPO for Unified Multimodal Models
by: Rao, Abinav, et al.
Published: (2026)

Composition Vision-Language Understanding via Segment and Depth Anything Model
by: Huo, Mingxiao, et al.
Published: (2024)

PSA: Pyramid Sparse Attention for Efficient Video Understanding and Generation
by: Li, Xiaolong, et al.
Published: (2025)

Video Understanding by Design: How Datasets Shape Architectures and Insights
by: Wang, Lei, et al.
Published: (2025)

Language-Image Models with 3D Understanding
by: Cho, Jang Hyun, et al.
Published: (2024)

See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models
by: Nguyen, Le Thien Phuc, et al.
Published: (2025)

Effortless Active Labeling for Long-Term Test-Time Adaptation
by: Wang, Guowei, et al.
Published: (2025)

Do Transformers Understand Ancient Roman Coin Motifs Better than CNNs?
by: Reid, David, et al.
Published: (2026)

Understanding Retrieval-Augmented Task Adaptation for Vision-Language Models
by: Ming, Yifei, et al.
Published: (2024)

HyDRA: Hierarchical and Dynamic Rank Adaptation for Mobile Vision Language Model
by: Xi, Yuanhao, et al.
Published: (2025)

The Underappreciated Power of Vision Models for Graph Structural Understanding
by: Zhao, Xinjian, et al.
Published: (2025)

VGBench: Evaluating Large Language Models on Vector Graphics Understanding and Generation
by: Zou, Bocheng, et al.
Published: (2024)

Adaptive Keyframe Sampling for Long Video Understanding
by: Tang, Xi, et al.
Published: (2025)

Doubly Debiased Test-Time Prompt Tuning for Vision-Language Models
by: Song, Fei, et al.
Published: (2025)

ModelGrow: Continual Text-to-Video Pre-training with Model Expansion and Language Understanding Enhancement
by: Rao, Zhefan, et al.
Published: (2024)

Circuit Tracing in Vision-Language Models: Understanding the Internal Mechanisms of Multimodal Thinking
by: Yang, Jingcheng, et al.
Published: (2026)

Tree of Attributes Prompt Learning for Vision-Language Models
by: Ding, Tong, et al.
Published: (2024)

About Time: Advances, Challenges, and Outlooks of Action Understanding
by: Stergiou, Alexandros, et al.
Published: (2024)

Towards Generalisable Time Series Understanding Across Domains
by: Turgut, Özgün, et al.
Published: (2024)

AdaNeg: Adaptive Negative Proxy Guided OOD Detection with Vision-Language Models
by: Zhang, Yabin, et al.
Published: (2024)

Just Shift It: Test-Time Prototype Shifting for Zero-Shot Generalization with Vision-Language Models
by: Sui, Elaine, et al.
Published: (2024)

Understanding the Effects of Distractors on Reasoning Vision-Language Models
by: Bae, Jiyun, et al.
Published: (2025)

OpenSUN3D: 1st Workshop Challenge on Open-Vocabulary 3D Scene Understanding
by: Engelmann, Francis, et al.
Published: (2024)

Harnessing Vision-Language Models for Time Series Anomaly Detection
by: He, Zelin, et al.
Published: (2025)

RGB-Th-Bench: A Dense benchmark for Visual-Thermal Understanding of Vision Language Models
by: Moshtaghi, Mehdi, et al.
Published: (2025)

Leveraging Large Language Models for Scalable Vector Graphics-Driven Image Understanding
by: Cai, Mu, et al.
Published: (2023)

Feedback-based Modal Mutual Search for Attacking Vision-Language Pre-training Models
by: Ding, Renhua, et al.
Published: (2024)

HourVideo: 1-Hour Video-Language Understanding
by: Chandrasegaran, Keshigeyan, et al.
Published: (2024)

Understanding the Fine-Grained Knowledge Capabilities of Vision-Language Models
by: Ghosh, Dhruba, et al.
Published: (2026)

Argus Inspection: Do Multimodal Large Language Models Possess the Eye of Panoptes?
by: Yao, Yang, et al.
Published: (2025)

VLM See, Robot Do: Human Demo Video to Robot Action Plan via Vision Language Model
by: Wang, Beichen, et al.
Published: (2024)

CaTS-Bench: Can Language Models Describe Time Series?
by: Zhou, Luca, et al.
Published: (2025)

How Do Vision-Language Models Process Conflicting Information Across Modalities?
by: Hua, Tianze, et al.
Published: (2025)

Can Large Language Models Understand Symbolic Graphics Programs?
by: Qiu, Zeju, et al.
Published: (2024)