Saved in:
| Main Authors: | Vani, Ankit, Nguyen, Bac, Lavoie, Samuel, Krishna, Ranjay, Courville, Aaron |
|---|---|
| Format: | Preprint |
| Published: |
2024
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2404.15721 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
by: Lavoie, Samuel, et al.
Published: (2025)
by: Lavoie, Samuel, et al.
Published: (2025)
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
by: Lavoie, Samuel, et al.
Published: (2024)
by: Lavoie, Samuel, et al.
Published: (2024)
BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding
by: Nguyen, Xuan-Bac, et al.
Published: (2025)
by: Nguyen, Xuan-Bac, et al.
Published: (2025)
Selective Visual Representations Improve Convergence and Generalization for Embodied AI
by: Eftekhar, Ainaz, et al.
Published: (2023)
by: Eftekhar, Ainaz, et al.
Published: (2023)
Weierstrass Positional Encoding for Vision Transformers
by: Xin, Zhihang, et al.
Published: (2026)
by: Xin, Zhihang, et al.
Published: (2026)
Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
by: Fan, Xiang, et al.
Published: (2024)
by: Fan, Xiang, et al.
Published: (2024)
PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning
by: Durairaju, Raahul Krishna, et al.
Published: (2025)
by: Durairaju, Raahul Krishna, et al.
Published: (2025)
Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
by: Huang, Weikai, et al.
Published: (2025)
by: Huang, Weikai, et al.
Published: (2025)
VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
by: He, Qijia, et al.
Published: (2026)
by: He, Qijia, et al.
Published: (2026)
You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
by: Yang, Yinuo, et al.
Published: (2026)
by: Yang, Yinuo, et al.
Published: (2026)
The Linear Attention Resurrection in Vision Transformer
by: Zheng, Chuanyang
Published: (2025)
by: Zheng, Chuanyang
Published: (2025)
Spiking Vision Transformer with Saccadic Attention
by: Wang, Shuai, et al.
Published: (2025)
by: Wang, Shuai, et al.
Published: (2025)
RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
by: Yuan, Wentao, et al.
Published: (2024)
by: Yuan, Wentao, et al.
Published: (2024)
A 2D Semantic-Aware Position Encoding for Vision Transformers
by: Chen, Xi, et al.
Published: (2025)
by: Chen, Xi, et al.
Published: (2025)
Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models
by: Che, Liwei, et al.
Published: (2026)
by: Che, Liwei, et al.
Published: (2026)
Attention Retention for Continual Learning with Vision Transformers
by: Lu, Yue, et al.
Published: (2026)
by: Lu, Yue, et al.
Published: (2026)
SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
by: Nguyen, Bac, et al.
Published: (2024)
by: Nguyen, Bac, et al.
Published: (2024)
ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
by: Zhang, Jieyu, et al.
Published: (2024)
by: Zhang, Jieyu, et al.
Published: (2024)
When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
by: Ikezogwo, Wisdom, et al.
Published: (2026)
by: Ikezogwo, Wisdom, et al.
Published: (2026)
Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention
by: Leem, Saebom, et al.
Published: (2024)
by: Leem, Saebom, et al.
Published: (2024)
Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders
by: Koch, Paul, et al.
Published: (2025)
by: Koch, Paul, et al.
Published: (2025)
GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
by: Kamath, Amita, et al.
Published: (2025)
by: Kamath, Amita, et al.
Published: (2025)
Mechanisms of Non-Monotonic Scaling in Vision Transformers
by: Kumar, Anantha Padmanaban Krishna
Published: (2025)
by: Kumar, Anantha Padmanaban Krishna
Published: (2025)
Dissecting Query-Key Interaction in Vision Transformers
by: Pan, Xu, et al.
Published: (2024)
by: Pan, Xu, et al.
Published: (2024)
PolaFormer: Polarity-aware Linear Attention for Vision Transformers
by: Meng, Weikang, et al.
Published: (2025)
by: Meng, Weikang, et al.
Published: (2025)
Sensitive Image Classification by Vision Transformers
by: He, Hanxian, et al.
Published: (2024)
by: He, Hanxian, et al.
Published: (2024)
Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
by: Gao, Ziqi, et al.
Published: (2024)
by: Gao, Ziqi, et al.
Published: (2024)
Explain Before You Answer: A Survey on Compositional Visual Reasoning
by: Ke, Fucai, et al.
Published: (2025)
by: Ke, Fucai, et al.
Published: (2025)
Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
by: Seyfioglu, Mehmet Saygin, et al.
Published: (2023)
by: Seyfioglu, Mehmet Saygin, et al.
Published: (2023)
LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction
by: Yan, Yixin, et al.
Published: (2025)
by: Yan, Yixin, et al.
Published: (2025)
Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning
by: Devoto, Alessio, et al.
Published: (2024)
by: Devoto, Alessio, et al.
Published: (2024)
ConstStyle: Robust Domain Generalization with Unified Style Transformation
by: Tran, Nam Duong, et al.
Published: (2025)
by: Tran, Nam Duong, et al.
Published: (2025)
GenRL: Multimodal-foundation world models for generalization in embodied agents
by: Mazzaglia, Pietro, et al.
Published: (2024)
by: Mazzaglia, Pietro, et al.
Published: (2024)
Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
by: Nguyen, Bac, et al.
Published: (2026)
by: Nguyen, Bac, et al.
Published: (2026)
Towards Robust Vision Transformer via Masked Adaptive Ensemble
by: Lin, Fudong, et al.
Published: (2024)
by: Lin, Fudong, et al.
Published: (2024)
Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer
by: Liao, Yi, et al.
Published: (2025)
by: Liao, Yi, et al.
Published: (2025)
ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages
by: Qian, Zhoujie
Published: (2025)
by: Qian, Zhoujie
Published: (2025)
Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers
by: Padalkar, Parth, et al.
Published: (2025)
by: Padalkar, Parth, et al.
Published: (2025)
MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
by: Hua, Wei, et al.
Published: (2025)
by: Hua, Wei, et al.
Published: (2025)
Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers
by: Knights, Ethan
Published: (2026)
by: Knights, Ethan
Published: (2026)
Similar Items
-
Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
by: Lavoie, Samuel, et al.
Published: (2025) -
Modeling Caption Diversity in Contrastive Vision-Language Pretraining
by: Lavoie, Samuel, et al.
Published: (2024) -
BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding
by: Nguyen, Xuan-Bac, et al.
Published: (2025) -
Selective Visual Representations Improve Convergence and Generalization for Embodied AI
by: Eftekhar, Ainaz, et al.
Published: (2023) -
Weierstrass Positional Encoding for Vision Transformers
by: Xin, Zhihang, et al.
Published: (2026)