Guardado en:
| Autores principales: | Huang, Xiaoshuang, Huang, Haifeng, Shen, Lingdong, Yang, Yehui, Shang, Fangxin, Liu, Junwei, Liu, Jia |
|---|---|
| Formato: | Preprint |
| Publicado: |
2024
|
| Materias: | |
| Acceso en línea: | https://arxiv.org/abs/2406.18146 |
| Etiquetas: |
Agregar Etiqueta
Sin Etiquetas, Sea el primero en etiquetar este registro!
|
Ejemplares similares
Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine
por: Huang, Xiaoshuang, et al.
Publicado: (2024)
por: Huang, Xiaoshuang, et al.
Publicado: (2024)
SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging
por: Shen, Lingdong, et al.
Publicado: (2024)
por: Shen, Lingdong, et al.
Publicado: (2024)
SynFundus-1M: A High-quality Million-scale Synthetic fundus images Dataset with Fifteen Types of Annotation
por: Shang, Fangxin, et al.
Publicado: (2023)
por: Shang, Fangxin, et al.
Publicado: (2023)
Zero-Shot 3D Visual Grounding from Vision-Language Models
por: Li, Rong, et al.
Publicado: (2025)
por: Li, Rong, et al.
Publicado: (2025)
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
por: Li, Rong, et al.
Publicado: (2024)
por: Li, Rong, et al.
Publicado: (2024)
FCMBench: The First Large-scale Financial Credit Multimodal Benchmark for Real-world Applications
por: Yang, Yehui, et al.
Publicado: (2026)
por: Yang, Yehui, et al.
Publicado: (2026)
KiToke: Kernel-based Interval-aware Token Compression for Video Large Language Models
por: Huang, Haifeng, et al.
Publicado: (2026)
por: Huang, Haifeng, et al.
Publicado: (2026)
FCMBench-Video: Benchmarking Document Video Intelligence
por: Cui, Runze, et al.
Publicado: (2026)
por: Cui, Runze, et al.
Publicado: (2026)
Attribute-Grounded Selective Reasoning for Artwork Emotion Understanding with Multimodal Large Language Models
por: Zhang, Cheng, et al.
Publicado: (2026)
por: Zhang, Cheng, et al.
Publicado: (2026)
Grounded 3D-LLM with Referent Tokens
por: Chen, Yilun, et al.
Publicado: (2024)
por: Chen, Yilun, et al.
Publicado: (2024)
A Survey on Video Temporal Grounding with Multimodal Large Language Model
por: Wu, Jianlong, et al.
Publicado: (2025)
por: Wu, Jianlong, et al.
Publicado: (2025)
Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts
por: Rang, Miao, et al.
Publicado: (2025)
por: Rang, Miao, et al.
Publicado: (2025)
ChatTracker: Enhancing Visual Tracking Performance via Chatting with Multimodal Large Language Model
por: Sun, Yiming, et al.
Publicado: (2024)
por: Sun, Yiming, et al.
Publicado: (2024)
3EED: Ground Everything Everywhere in 3D
por: Li, Rong, et al.
Publicado: (2025)
por: Li, Rong, et al.
Publicado: (2025)
TSCM: A Teacher-Student Model for Vision Place Recognition Using Cross-Metric Knowledge Distillation
por: Shen, Yehui, et al.
Publicado: (2024)
por: Shen, Yehui, et al.
Publicado: (2024)
MeteorPred: A Meteorological Multimodal Large Model and Dataset for Severe Weather Event Prediction
por: Tang, Shuo, et al.
Publicado: (2025)
por: Tang, Shuo, et al.
Publicado: (2025)
Test-Time Computing for Referring Multimodal Large Language Models
por: Wu, Mingrui, et al.
Publicado: (2026)
por: Wu, Mingrui, et al.
Publicado: (2026)
Adversarial Robustness for Visual Grounding of Multimodal Large Language Models
por: Gao, Kuofeng, et al.
Publicado: (2024)
por: Gao, Kuofeng, et al.
Publicado: (2024)
Visual Anchors Are Strong Information Aggregators For Multimodal Large Language Model
por: Liu, Haogeng, et al.
Publicado: (2024)
por: Liu, Haogeng, et al.
Publicado: (2024)
Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models
por: Li, You, et al.
Publicado: (2025)
por: Li, You, et al.
Publicado: (2025)
Robin3D: Improving 3D Large Language Model via Robust Instruction Tuning
por: Kang, Weitai, et al.
Publicado: (2024)
por: Kang, Weitai, et al.
Publicado: (2024)
MedSeg-R: Reasoning Segmentation in Medical Images with Multimodal Large Language Models
por: Huang, Yu, et al.
Publicado: (2025)
por: Huang, Yu, et al.
Publicado: (2025)
Efficient Temporal Extrapolation of Multimodal Large Language Models with Temporal Grounding Bridge
por: Wang, Yuxuan, et al.
Publicado: (2024)
por: Wang, Yuxuan, et al.
Publicado: (2024)
Plug-and-Play Grounding of Reasoning in Multimodal Large Language Models
por: Chen, Jiaxing, et al.
Publicado: (2024)
por: Chen, Jiaxing, et al.
Publicado: (2024)
Ferret-v2: An Improved Baseline for Referring and Grounding with Large Language Models
por: Zhang, Haotian, et al.
Publicado: (2024)
por: Zhang, Haotian, et al.
Publicado: (2024)
REVEAL: Reference-Grounded Reasoning for Multimodal Manipulation Detection
por: Zhou, Jun, et al.
Publicado: (2026)
por: Zhou, Jun, et al.
Publicado: (2026)
Grounded Chain-of-Thought for Multimodal Large Language Models
por: Wu, Qiong, et al.
Publicado: (2025)
por: Wu, Qiong, et al.
Publicado: (2025)
SpaceVLLM: Endowing Multimodal Large Language Model with Spatio-Temporal Video Grounding Capability
por: Wang, Jiankang, et al.
Publicado: (2025)
por: Wang, Jiankang, et al.
Publicado: (2025)
MMRQA: Signal-Enhanced Multimodal Large Language Models for MRI Quality Assessment
por: Jia, Fankai, et al.
Publicado: (2025)
por: Jia, Fankai, et al.
Publicado: (2025)
GroundVTS: Visual Token Sampling in Multimodal Large Language Models for Video Temporal Grounding
por: Fan, Rong, et al.
Publicado: (2026)
por: Fan, Rong, et al.
Publicado: (2026)
Grounding Everything in Tokens for Multimodal Large Language Models
por: Ren, Xiangxuan, et al.
Publicado: (2025)
por: Ren, Xiangxuan, et al.
Publicado: (2025)
EndoChat: Grounded Multimodal Large Language Model for Endoscopic Surgery
por: Wang, Guankun, et al.
Publicado: (2025)
por: Wang, Guankun, et al.
Publicado: (2025)
Situat3DChange: Situated 3D Change Understanding Dataset for Multimodal Large Language Model
por: Liu, Ruiping, et al.
Publicado: (2025)
por: Liu, Ruiping, et al.
Publicado: (2025)
Chat-Scene: Bridging 3D Scene and Large Language Models with Object Identifiers
por: Huang, Haifeng, et al.
Publicado: (2023)
por: Huang, Haifeng, et al.
Publicado: (2023)
F-LMM: Grounding Frozen Large Multimodal Models
por: Wu, Size, et al.
Publicado: (2024)
por: Wu, Size, et al.
Publicado: (2024)
Human-MME: A Holistic Evaluation Benchmark for Human-Centric Multimodal Large Language Models
por: Liu, Yuansen, et al.
Publicado: (2025)
por: Liu, Yuansen, et al.
Publicado: (2025)
Grounding-IQA: Grounding Multimodal Language Model for Image Quality Assessment
por: Chen, Zheng, et al.
Publicado: (2024)
por: Chen, Zheng, et al.
Publicado: (2024)
Measuring Epistemic Humility in Multimodal Large Language Models
por: Tong, Bingkui, et al.
Publicado: (2025)
por: Tong, Bingkui, et al.
Publicado: (2025)
Leveraging Large Language Models for Multimodal Search
por: Barbany, Oriol, et al.
Publicado: (2024)
por: Barbany, Oriol, et al.
Publicado: (2024)
Attention-driven GUI Grounding: Leveraging Pretrained Multimodal Large Language Models without Fine-Tuning
por: Xu, Hai-Ming, et al.
Publicado: (2024)
por: Xu, Hai-Ming, et al.
Publicado: (2024)
Ejemplares similares
-
Towards a Multimodal Large Language Model with Pixel-Level Insight for Biomedicine
por: Huang, Xiaoshuang, et al.
Publicado: (2024) -
SegICL: A Multimodal In-context Learning Framework for Enhanced Segmentation in Medical Imaging
por: Shen, Lingdong, et al.
Publicado: (2024) -
SynFundus-1M: A High-quality Million-scale Synthetic fundus images Dataset with Fifteen Types of Annotation
por: Shang, Fangxin, et al.
Publicado: (2023) -
Zero-Shot 3D Visual Grounding from Vision-Language Models
por: Li, Rong, et al.
Publicado: (2025) -
SeeGround: See and Ground for Zero-Shot Open-Vocabulary 3D Visual Grounding
por: Li, Rong, et al.
Publicado: (2024)