:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Li, Zhiqi, Chen, Guo, Liu, Shilong, Wang, Shihao, VS, Vibashan, Ji, Yishen, Lan, Shiyi, Zhang, Hao, Zhao, Yilin, Radhakrishnan, Subhashree, Chang, Nadine, Sapra, Karan, Deshmukh, Amala Sanjay, Rintamaki, Tuomas, Le, Matthieu, Karmanov, Ilia, Voegtle, Lukas, Fischer, Philipp, Huang, De-An, Roman, Timo, Lu, Tong, Alvarez, Jose M., Catanzaro, Bryan, Kautz, Jan, Tao, Andrew, Liu, Guilin, Yu, Zhiding
Format:	Preprint
Published:	2025
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence Machine Learning
Online Access:	https://arxiv.org/abs/2501.14818
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
by: Shi, Min, et al.
Published: (2024)

Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents
by: Karmanov, Ilia, et al.
Published: (2025)

Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
by: Chen, Guo, et al.
Published: (2025)

Stateful Token Reduction for Long-Video Hybrid VLMs
by: Jiang, Jindong, et al.
Published: (2026)

FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
by: Huang, De-An, et al.
Published: (2025)

LITA: Language Instructed Temporal-Localization Assistant
by: Huang, De-An, et al.
Published: (2024)

Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
by: Li, Zhenxin, et al.
Published: (2024)

FaceXBench: Evaluating Multimodal LLMs on Face Understanding
by: Narayan, Kartik, et al.
Published: (2025)

Certainty and Uncertainty Guided Active Domain Adaptation
by: Safaei, Bardia, et al.
Published: (2025)

SegFace: Face Segmentation of Long-Tail Classes
by: Narayan, Kartik, et al.
Published: (2024)

AIDE: Agentically Improve Visual Language Model with Domain Experts
by: Chiu, Ming-Chang, et al.
Published: (2025)

LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
by: Wang, Shihao, et al.
Published: (2026)

What is Point Supervision Worth in Video Instance Segmentation?
by: Huang, Shuaiyi, et al.
Published: (2024)

Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
by: Li, Zhiqi, et al.
Published: (2023)

FaceXFormer: A Unified Transformer for Facial Analysis
by: Narayan, Kartik, et al.
Published: (2024)

Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
by: Zhang, Shaokun, et al.
Published: (2025)

OMCAT: Omni Context Aware Transformer
by: Goel, Arushi, et al.
Published: (2024)

PhyCritic: Multimodal Critic Models for Physical AI
by: Xiong, Tianyi, et al.
Published: (2026)

StreamChat: Chatting with Streaming Video
by: Liu, Jihao, et al.
Published: (2024)

ImagineMap: Enhanced HD Map Construction with SD Maps
by: Ji, Yishen, et al.
Published: (2024)

VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
by: Wang, Shihao, et al.
Published: (2025)

Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
by: Man, Yunze, et al.
Published: (2025)

Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
by: Ranasinghe, Yasiru, et al.
Published: (2025)

Mensch - Maske - Tier. Zu den Entstehungsbedingungen der Karikatur
by: Simone Voegtle
Published: (2017)

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
by: Wang, Shihao, et al.
Published: (2025)

OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
by: Wang, Shihao, et al.
Published: (2024)

NVIDIA Nemotron Parse 1.1
by: Chumachenko, Kateryna, et al.
Published: (2025)

PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
by: Schmalfuss, Jenny, et al.
Published: (2025)

Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
by: Chen, Guo, et al.
Published: (2026)

Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
by: Li, Zhenxin, et al.
Published: (2025)

NVLM: Open Frontier-Class Multimodal LLMs
by: Dai, Wenliang, et al.
Published: (2024)

PosSAM: Panoptic Open-vocabulary Segment Anything
by: VS, Vibashan, et al.
Published: (2024)

LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
by: Man, Yunze, et al.
Published: (2025)

3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
by: Zhen, Haoyu, et al.
Published: (2026)

StereoDETR: Stereo-based Transformer for 3D Object Detection
by: Mu, Shiyi, et al.
Published: (2025)

Old age, high risk medication, polypharmacy: a ‘trilogy’ of risks in older patients with atrial fibrillation
by: Yishen WANG
Published: (2016)

Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
by: Liu, Yishen
Published: (2025)

Developing an Ontology for AI Act Fundamental Rights Impact Assessments
by: Rintamaki, Tytti, et al.
Published: (2024)

Towards An Automated AI Act FRIA Tool That Can Reuse GDPR's DPIA
by: Rintamaki, Tytti, et al.
Published: (2024)

Slow-Fast Architecture for Video Multi-Modal Large Language Models
by: Shi, Min, et al.
Published: (2025)