:: Library Catalog

Cover Image

Saved in:

Bibliographic Details
Main Authors:	Vani, Ankit, Nguyen, Bac, Lavoie, Samuel, Krishna, Ranjay, Courville, Aaron
Format:	Preprint
Published:	2024
Subjects:	Computer Vision and Pattern Recognition Artificial Intelligence
Online Access:	https://arxiv.org/abs/2404.15721
Tags:	Add Tag No Tags, Be the first to tag this record!

Similar Items

Compositional Discrete Latent Code for High Fidelity, Productive Diffusion Models
by: Lavoie, Samuel, et al.
Published: (2025)

Modeling Caption Diversity in Contrastive Vision-Language Pretraining
by: Lavoie, Samuel, et al.
Published: (2024)

BRAIN: Bias-Mitigation Continual Learning Approach to Vision-Brain Understanding
by: Nguyen, Xuan-Bac, et al.
Published: (2025)

Selective Visual Representations Improve Convergence and Generalization for Embodied AI
by: Eftekhar, Ainaz, et al.
Published: (2023)

Weierstrass Positional Encoding for Vision Transformers
by: Xin, Zhihang, et al.
Published: (2026)

Videoshop: Localized Semantic Video Editing with Noise-Extrapolated Diffusion Inversion
by: Fan, Xiang, et al.
Published: (2024)

PyramidStyler: Transformer-Based Neural Style Transfer with Pyramidal Positional Encoding and Reinforcement Learning
by: Durairaju, Raahul Krishna, et al.
Published: (2025)

Synthetic Object Compositions for Scalable and Accurate Learning in Detection, Segmentation, and Grounding
by: Huang, Weikai, et al.
Published: (2025)

VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
by: He, Qijia, et al.
Published: (2026)

You Only Judge Once: Multi-response Reward Modeling in a Single Forward Pass
by: Yang, Yinuo, et al.
Published: (2026)

The Linear Attention Resurrection in Vision Transformer
by: Zheng, Chuanyang
Published: (2025)

Spiking Vision Transformer with Saccadic Attention
by: Wang, Shuai, et al.
Published: (2025)

RoboPoint: A Vision-Language Model for Spatial Affordance Prediction for Robotics
by: Yuan, Wentao, et al.
Published: (2024)

A 2D Semantic-Aware Position Encoding for Vision Transformers
by: Chen, Xi, et al.
Published: (2025)

Counting Circuits: Mechanistic Interpretability of Visual Reasoning in Large Vision-Language Models
by: Che, Liwei, et al.
Published: (2026)

Attention Retention for Continual Learning with Vision Transformers
by: Lu, Yue, et al.
Published: (2026)

SAFT: Towards Out-of-Distribution Generalization in Fine-Tuning
by: Nguyen, Bac, et al.
Published: (2024)

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
by: Zhang, Jieyu, et al.
Published: (2024)

When Rubrics Fail: Error Enumeration as Reward in Reference-Free RL Post-Training for Virtual Try-On
by: Ikezogwo, Wisdom, et al.
Published: (2026)

Attention Guided CAM: Visual Explanations of Vision Transformer Guided by Self-Attention
by: Leem, Saebom, et al.
Published: (2024)

Vanishing Depth: A Depth Adapter with Positional Depth Encoding for Generalized Image Encoders
by: Koch, Paul, et al.
Published: (2025)

GenEval 2: Addressing Benchmark Drift in Text-to-Image Evaluation
by: Kamath, Amita, et al.
Published: (2025)

Mechanisms of Non-Monotonic Scaling in Vision Transformers
by: Kumar, Anantha Padmanaban Krishna
Published: (2025)

Dissecting Query-Key Interaction in Vision Transformers
by: Pan, Xu, et al.
Published: (2024)

PolaFormer: Polarity-aware Linear Attention for Vision Transformers
by: Meng, Weikang, et al.
Published: (2025)

Sensitive Image Classification by Vision Transformers
by: He, Hanxian, et al.
Published: (2024)

Generate Any Scene: Scene Graph Driven Data Synthesis for Visual Generation Training
by: Gao, Ziqi, et al.
Published: (2024)

Explain Before You Answer: A Survey on Compositional Visual Reasoning
by: Ke, Fucai, et al.
Published: (2025)

Quilt-LLaVA: Visual Instruction Tuning by Extracting Localized Narratives from Open-Source Histopathology Videos
by: Seyfioglu, Mehmet Saygin, et al.
Published: (2023)

LTMSformer: A Local Trend-Aware Attention and Motion State Encoding Transformer for Multi-Agent Trajectory Prediction
by: Yan, Yixin, et al.
Published: (2025)

Adaptive Layer Selection for Efficient Vision Transformer Fine-Tuning
by: Devoto, Alessio, et al.
Published: (2024)

ConstStyle: Robust Domain Generalization with Unified Style Transformation
by: Tran, Nam Duong, et al.
Published: (2025)

GenRL: Multimodal-foundation world models for generalization in embodied agents
by: Mazzaglia, Pietro, et al.
Published: (2024)

Improved Object-Centric Diffusion Learning with Registers and Contrastive Alignment
by: Nguyen, Bac, et al.
Published: (2026)

Towards Robust Vision Transformer via Masked Adaptive Ensemble
by: Lin, Fudong, et al.
Published: (2024)

Dynamic Accumulated Attention Map for Interpreting Evolution of Decision-Making in Vision Transformer
by: Liao, Yi, et al.
Published: (2025)

ECViT: Efficient Convolutional Vision Transformer with Local-Attention and Multi-scale Stages
by: Qian, Zhoujie
Published: (2025)

Symbolic Rule Extraction from Attention-Guided Sparse Representations in Vision Transformers
by: Padalkar, Parth, et al.
Published: (2025)

MSVIT: Improving Spiking Vision Transformer Using Multi-scale Attention Fusion
by: Hua, Wei, et al.
Published: (2025)

Cognitive Alignment At No Cost: Inducing Human Attention Biases For Interpretable Vision Transformers
by: Knights, Ethan
Published: (2026)