Saved in:
| Main Authors: | Vasu, Pavan Kumar Anasosalu, Faghri, Fartash, Li, Chun-Liang, Koc, Cem, True, Nate, Antony, Albert, Santhanam, Gokul, Gabriel, James, Grasch, Peter, Tuzel, Oncel, Pouransari, Hadi |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2412.13303 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024)
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2023)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2023)
MobileCLIP2: Improving Multi-Modal Reinforced Training
by: Faghri, Fartash, et al.
Published: (2025)
by: Faghri, Fartash, et al.
Published: (2025)
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2026)
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2026)
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
by: Wang, Haoxiang, et al.
Published: (2023)
by: Wang, Haoxiang, et al.
Published: (2023)
FocalLens: Instruction Tuning Enables Zero-Shot Conditional Image Representations
by: Hsieh, Cheng-Yu, et al.
Published: (2025)
by: Hsieh, Cheng-Yu, et al.
Published: (2025)
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum
by: Pouransari, Hadi, et al.
Published: (2024)
by: Pouransari, Hadi, et al.
Published: (2024)
Knowledge Transfer from Vision Foundation Models for Efficient Training of Small Task-specific Models
by: Vemulapalli, Raviteja, et al.
Published: (2023)
by: Vemulapalli, Raviteja, et al.
Published: (2023)
MUSCLE: A Model Update Strategy for Compatible LLM Evolution
by: Echterhoff, Jessica, et al.
Published: (2024)
by: Echterhoff, Jessica, et al.
Published: (2024)
Proxy-FDA: Proxy-based Feature Distribution Alignment for Fine-tuning Vision Foundation Models without Forgetting
by: Huang, Chen, et al.
Published: (2025)
by: Huang, Chen, et al.
Published: (2025)
AMUSE: Audio-Visual Benchmark and Alignment Framework for Agentic Multi-Speaker Understanding
by: Chowdhury, Sanjoy, et al.
Published: (2025)
by: Chowdhury, Sanjoy, et al.
Published: (2025)
TiC-CLIP: Continual Training of CLIP Models
by: Garg, Saurabh, et al.
Published: (2023)
by: Garg, Saurabh, et al.
Published: (2023)
Learning from Self Critique and Refinement for Faithful LLM Summarization
by: Hu, Ting-Yao, et al.
Published: (2025)
by: Hu, Ting-Yao, et al.
Published: (2025)
TiC-LM: A Web-Scale Benchmark for Time-Continual LLM Pretraining
by: Li, Jeffrey, et al.
Published: (2025)
by: Li, Jeffrey, et al.
Published: (2025)
Pretraining with hierarchical memories: separating long-tail and common knowledge
by: Pouransari, Hadi, et al.
Published: (2025)
by: Pouransari, Hadi, et al.
Published: (2025)
FastVLM: Self-Speculative Decoding for Fast Vision-Language Model Inference
by: Bajpai, Divya Jyoti, et al.
Published: (2025)
by: Bajpai, Divya Jyoti, et al.
Published: (2025)
CatLIP: CLIP-level Visual Recognition Accuracy with 2.7x Faster Pre-training on Web-scale Image-Text Data
by: Mehta, Sachin, et al.
Published: (2024)
by: Mehta, Sachin, et al.
Published: (2024)
Graph-Based Captioning: Enhancing Visual Descriptions by Interconnecting Region Captions
by: Hsieh, Yu-Guan, et al.
Published: (2024)
by: Hsieh, Yu-Guan, et al.
Published: (2024)
Learning to Reason for Hallucination Span Detection
by: Su, Hsuan, et al.
Published: (2025)
by: Su, Hsuan, et al.
Published: (2025)
Beyond a Single Extractor: Re-thinking HTML-to-Text Extraction for LLM Pretraining
by: Li, Jeffrey, et al.
Published: (2026)
by: Li, Jeffrey, et al.
Published: (2026)
RayRoPE: Projective Ray Positional Encoding for Multi-view Attention
by: Wu, Yu, et al.
Published: (2026)
by: Wu, Yu, et al.
Published: (2026)
Mutual Reinforcement of LLM Dialogue Synthesis and Summarization Capabilities for Few-Shot Dialogue Summarization
by: Lu, Yen-Ju, et al.
Published: (2025)
by: Lu, Yen-Ju, et al.
Published: (2025)
LiTo: Surface Light Field Tokenization
by: Chang, Jen-Hao Rick, et al.
Published: (2026)
by: Chang, Jen-Hao Rick, et al.
Published: (2026)
GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
by: Mirzadeh, Iman, et al.
Published: (2024)
by: Mirzadeh, Iman, et al.
Published: (2024)
SD-VLM: Spatial Measuring and Understanding with Depth-Encoded Vision-Language Models
by: Chen, Pingyi, et al.
Published: (2025)
by: Chen, Pingyi, et al.
Published: (2025)
SpecVLM: Fast Speculative Decoding in Vision-Language Models
by: Huang, Haiduo, et al.
Published: (2025)
by: Huang, Haiduo, et al.
Published: (2025)
Computational Bottlenecks of Training Small-scale Large Language Models
by: Ashkboos, Saleh, et al.
Published: (2024)
by: Ashkboos, Saleh, et al.
Published: (2024)
Co‐Agent Assisted Peroxide Vulcanization of Halogen‐Free Flame Retardant EPDM Compounds for Cable Sheathing
by: Gürcan Gül, et al.
Published: (2025)
by: Gürcan Gül, et al.
Published: (2025)
Local-to-Global Logical Explanations for Deep Vision Models
by: Vasu, Bhavan, et al.
Published: (2026)
by: Vasu, Bhavan, et al.
Published: (2026)
Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity
by: Joudaki, Amir, et al.
Published: (2025)
by: Joudaki, Amir, et al.
Published: (2025)
Scaling Smart: Accelerating Large Language Model Pre-training with Small Model Initialization
by: Samragh, Mohammad, et al.
Published: (2024)
by: Samragh, Mohammad, et al.
Published: (2024)
Description of a new species of bat, Vespertilio longicrus, from Puget Sound
by: True, Frederick W.
Published: (1887)
by: True, Frederick W.
Published: (1887)
Presentación
by: Tirza True Latimer
Published: (2013)
by: Tirza True Latimer
Published: (2013)
TrajTok: Learning Trajectory Tokens enables better Video Understanding
by: Zheng, Chenhao, et al.
Published: (2026)
by: Zheng, Chenhao, et al.
Published: (2026)
Velox: Learning Representations of 4D Geometry and Appearance
by: Malik, Anagh, et al.
Published: (2026)
by: Malik, Anagh, et al.
Published: (2026)
El uso de las redes sociales y la cultura popular para una mejor comprensión intercultural
by: Sait Tuzel
Published: (2017)
by: Sait Tuzel
Published: (2017)
MobileVLM : A Fast, Strong and Open Vision Language Assistant for Mobile Devices
by: Chu, Xiangxiang, et al.
Published: (2023)
by: Chu, Xiangxiang, et al.
Published: (2023)
EO-VLM: VLM-Guided Energy Overload Attacks on Vision Models
by: Seo, Minjae, et al.
Published: (2025)
by: Seo, Minjae, et al.
Published: (2025)
A Simple and Fast $(3+\varepsilon)$-approximation for Constrained Correlation Clustering
by: Veldt, Nate
Published: (2025)
by: Veldt, Nate
Published: (2025)
Adapting Vision-Language Models for E-commerce Understanding at Scale
by: Nulli, Matteo, et al.
Published: (2026)
by: Nulli, Matteo, et al.
Published: (2026)
Similar Items
-
CLIP with Quality Captions: A Strong Pretraining for Vision Tasks
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2024) -
MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2023) -
MobileCLIP2: Improving Multi-Modal Reinforced Training
by: Faghri, Fartash, et al.
Published: (2025) -
VSAS-Bench: Real-Time Evaluation of Visual Streaming Assistant Models
by: Vasu, Pavan Kumar Anasosalu, et al.
Published: (2026) -
SAM-CLIP: Merging Vision Foundation Models towards Semantic and Spatial Understanding
by: Wang, Haoxiang, et al.
Published: (2023)