Saved in:
| Main Authors: | Li, Zhiqi, Chen, Guo, Liu, Shilong, Wang, Shihao, VS, Vibashan, Ji, Yishen, Lan, Shiyi, Zhang, Hao, Zhao, Yilin, Radhakrishnan, Subhashree, Chang, Nadine, Sapra, Karan, Deshmukh, Amala Sanjay, Rintamaki, Tuomas, Le, Matthieu, Karmanov, Ilia, Voegtle, Lukas, Fischer, Philipp, Huang, De-An, Roman, Timo, Lu, Tong, Alvarez, Jose M., Catanzaro, Bryan, Kautz, Jan, Tao, Andrew, Liu, Guilin, Yu, Zhiding |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2501.14818 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
by: Shi, Min, et al.
Published: (2024)
by: Shi, Min, et al.
Published: (2024)
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents
by: Karmanov, Ilia, et al.
Published: (2025)
by: Karmanov, Ilia, et al.
Published: (2025)
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
by: Chen, Guo, et al.
Published: (2025)
by: Chen, Guo, et al.
Published: (2025)
Stateful Token Reduction for Long-Video Hybrid VLMs
by: Jiang, Jindong, et al.
Published: (2026)
by: Jiang, Jindong, et al.
Published: (2026)
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
by: Huang, De-An, et al.
Published: (2025)
by: Huang, De-An, et al.
Published: (2025)
LITA: Language Instructed Temporal-Localization Assistant
by: Huang, De-An, et al.
Published: (2024)
by: Huang, De-An, et al.
Published: (2024)
Hydra-MDP: End-to-end Multimodal Planning with Multi-target Hydra-Distillation
by: Li, Zhenxin, et al.
Published: (2024)
by: Li, Zhenxin, et al.
Published: (2024)
FaceXBench: Evaluating Multimodal LLMs on Face Understanding
by: Narayan, Kartik, et al.
Published: (2025)
by: Narayan, Kartik, et al.
Published: (2025)
Certainty and Uncertainty Guided Active Domain Adaptation
by: Safaei, Bardia, et al.
Published: (2025)
by: Safaei, Bardia, et al.
Published: (2025)
SegFace: Face Segmentation of Long-Tail Classes
by: Narayan, Kartik, et al.
Published: (2024)
by: Narayan, Kartik, et al.
Published: (2024)
AIDE: Agentically Improve Visual Language Model with Domain Experts
by: Chiu, Ming-Chang, et al.
Published: (2025)
by: Chiu, Ming-Chang, et al.
Published: (2025)
LocateAnything: Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding
by: Wang, Shihao, et al.
Published: (2026)
by: Wang, Shihao, et al.
Published: (2026)
What is Point Supervision Worth in Video Instance Segmentation?
by: Huang, Shuaiyi, et al.
Published: (2024)
by: Huang, Shuaiyi, et al.
Published: (2024)
Is Ego Status All You Need for Open-Loop End-to-End Autonomous Driving?
by: Li, Zhiqi, et al.
Published: (2023)
by: Li, Zhiqi, et al.
Published: (2023)
FaceXFormer: A Unified Transformer for Facial Analysis
by: Narayan, Kartik, et al.
Published: (2024)
by: Narayan, Kartik, et al.
Published: (2024)
Nemotron-Research-Tool-N1: Exploring Tool-Using Language Models with Reinforced Reasoning
by: Zhang, Shaokun, et al.
Published: (2025)
by: Zhang, Shaokun, et al.
Published: (2025)
OMCAT: Omni Context Aware Transformer
by: Goel, Arushi, et al.
Published: (2024)
by: Goel, Arushi, et al.
Published: (2024)
PhyCritic: Multimodal Critic Models for Physical AI
by: Xiong, Tianyi, et al.
Published: (2026)
by: Xiong, Tianyi, et al.
Published: (2026)
StreamChat: Chatting with Streaming Video
by: Liu, Jihao, et al.
Published: (2024)
by: Liu, Jihao, et al.
Published: (2024)
ImagineMap: Enhanced HD Map Construction with SD Maps
by: Ji, Yishen, et al.
Published: (2024)
by: Ji, Yishen, et al.
Published: (2024)
VideoITG: Multimodal Video Understanding with Instructed Temporal Grounding
by: Wang, Shihao, et al.
Published: (2025)
by: Wang, Shihao, et al.
Published: (2025)
Argus: Vision-Centric Reasoning with Grounded Chain-of-Thought
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
Zero-Shot Scene Understanding for Automatic Target Recognition Using Large Vision-Language Models
by: Ranasinghe, Yasiru, et al.
Published: (2025)
by: Ranasinghe, Yasiru, et al.
Published: (2025)
Mensch - Maske - Tier. Zu den Entstehungsbedingungen der Karikatur
by: Simone Voegtle
Published: (2017)
by: Simone Voegtle
Published: (2017)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
by: Wang, Shihao, et al.
Published: (2025)
by: Wang, Shihao, et al.
Published: (2025)
OmniDrive: A Holistic Vision-Language Dataset for Autonomous Driving with Counterfactual Reasoning
by: Wang, Shihao, et al.
Published: (2024)
by: Wang, Shihao, et al.
Published: (2024)
NVIDIA Nemotron Parse 1.1
by: Chumachenko, Kateryna, et al.
Published: (2025)
by: Chumachenko, Kateryna, et al.
Published: (2025)
PARC: A Quantitative Framework Uncovering the Symmetries within Vision Language Models
by: Schmalfuss, Jenny, et al.
Published: (2025)
by: Schmalfuss, Jenny, et al.
Published: (2025)
Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
by: Chen, Guo, et al.
Published: (2026)
by: Chen, Guo, et al.
Published: (2026)
Hydra-NeXt: Robust Closed-Loop Driving with Open-Loop Training
by: Li, Zhenxin, et al.
Published: (2025)
by: Li, Zhenxin, et al.
Published: (2025)
NVLM: Open Frontier-Class Multimodal LLMs
by: Dai, Wenliang, et al.
Published: (2024)
by: Dai, Wenliang, et al.
Published: (2024)
PosSAM: Panoptic Open-vocabulary Segment Anything
by: VS, Vibashan, et al.
Published: (2024)
by: VS, Vibashan, et al.
Published: (2024)
LocateAnything3D: Vision-Language 3D Detection with Chain-of-Sight
by: Man, Yunze, et al.
Published: (2025)
by: Man, Yunze, et al.
Published: (2025)
3D-Layout-R1: Structured Reasoning for Language-Instructed Spatial Editing
by: Zhen, Haoyu, et al.
Published: (2026)
by: Zhen, Haoyu, et al.
Published: (2026)
StereoDETR: Stereo-based Transformer for 3D Object Detection
by: Mu, Shiyi, et al.
Published: (2025)
by: Mu, Shiyi, et al.
Published: (2025)
Old age, high risk medication, polypharmacy: a ‘trilogy’ of risks in older patients with atrial fibrillation
by: Yishen WANG
Published: (2016)
by: Yishen WANG
Published: (2016)
Image-to-Text for Medical Reports Using Adaptive Co-Attention and Triple-LSTM Module
by: Liu, Yishen
Published: (2025)
by: Liu, Yishen
Published: (2025)
Developing an Ontology for AI Act Fundamental Rights Impact Assessments
by: Rintamaki, Tytti, et al.
Published: (2024)
by: Rintamaki, Tytti, et al.
Published: (2024)
Towards An Automated AI Act FRIA Tool That Can Reuse GDPR's DPIA
by: Rintamaki, Tytti, et al.
Published: (2024)
by: Rintamaki, Tytti, et al.
Published: (2024)
Slow-Fast Architecture for Video Multi-Modal Large Language Models
by: Shi, Min, et al.
Published: (2025)
by: Shi, Min, et al.
Published: (2025)
Similar Items
-
Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders
by: Shi, Min, et al.
Published: (2024) -
Éclair -- Extracting Content and Layout with Integrated Reading Order for Documents
by: Karmanov, Ilia, et al.
Published: (2025) -
Eagle 2.5: Boosting Long-Context Post-Training for Frontier Vision-Language Models
by: Chen, Guo, et al.
Published: (2025) -
Stateful Token Reduction for Long-Video Hybrid VLMs
by: Jiang, Jindong, et al.
Published: (2026) -
FRAG: Frame Selection Augmented Generation for Long Video and Long Document Understanding
by: Huang, De-An, et al.
Published: (2025)