Saved in:
| Main Author: | Fixelle, Joshua |
|---|---|
| Format: | Preprint |
| Published: |
2025
|
| Subjects: | |
| Online Access: | https://arxiv.org/abs/2504.08710 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Similar Items
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
by: Fan, Weijia, et al.
Published: (2026)
by: Fan, Weijia, et al.
Published: (2026)
More than a Moment: Towards Coherent Sequences of Audio Descriptions
by: Khandelwal, Eshika, et al.
Published: (2025)
by: Khandelwal, Eshika, et al.
Published: (2025)
Vision Transformers Need More Than Registers
by: Shi, Cheng, et al.
Published: (2026)
by: Shi, Cheng, et al.
Published: (2026)
Opinion: Learning Intuitive Physics May Require More than Visual Data
by: Su, Ellen, et al.
Published: (2025)
by: Su, Ellen, et al.
Published: (2025)
More than the Sum of Its Parts: Ensembling Backbone Networks for Few-Shot Segmentation
by: Catalano, Nico, et al.
Published: (2024)
by: Catalano, Nico, et al.
Published: (2024)
Nearly Solved? Robust Deepfake Detection Requires More than Visual Forensics
by: Levy, Guy, et al.
Published: (2024)
by: Levy, Guy, et al.
Published: (2024)
More than Segmentation: Benchmarking SAM 3 for Segmentation, 3D Perception, and Reconstruction in Robotic Surgery
by: Dong, Wenzhen, et al.
Published: (2025)
by: Dong, Wenzhen, et al.
Published: (2025)
There is More to Attention: Statistical Filtering Enhances Explanations in Vision Transformers
by: Ayyar, Meghna P, et al.
Published: (2025)
by: Ayyar, Meghna P, et al.
Published: (2025)
More than One Step at a Time: Designing Procedural Feedback for Non-visual Makeup Routines
by: Li, Franklin Mingzhe, et al.
Published: (2025)
by: Li, Franklin Mingzhe, et al.
Published: (2025)
Nodes Are Early, Edges Are Late: Probing Diagram Representations in Large Vision-Language Models
by: Yoshida, Haruto, et al.
Published: (2026)
by: Yoshida, Haruto, et al.
Published: (2026)
More than Memes: A Multimodal Topic Modeling Approach to Conspiracy Theories on Telegram
by: Steffen, Elisabeth
Published: (2024)
by: Steffen, Elisabeth
Published: (2024)
AdaNCA: Neural Cellular Automata As Adaptors For More Robust Vision Transformer
by: Xu, Yitao, et al.
Published: (2024)
by: Xu, Yitao, et al.
Published: (2024)
SVD-ViT: Does SVD Make Vision Transformers Attend More to the Foreground?
by: Murata, Haruhiko, et al.
Published: (2026)
by: Murata, Haruhiko, et al.
Published: (2026)
More Images, More Problems? A Controlled Analysis of VLM Failure Modes
by: Das, Anurag, et al.
Published: (2026)
by: Das, Anurag, et al.
Published: (2026)
Not (yet) the whole story: Evaluating Visual Storytelling Requires More than Measuring Coherence, Grounding, and Repetition
by: Surikuchi, Aditya K, et al.
Published: (2024)
by: Surikuchi, Aditya K, et al.
Published: (2024)
Representation Alignment for Just Image Transformers is not Easier than You Think
by: Shin, Jaeyo, et al.
Published: (2026)
by: Shin, Jaeyo, et al.
Published: (2026)
From Edges to Depth: Probing the Spatial Hierarchy in Vision Transformers
by: Sanghavi, Jainum
Published: (2026)
by: Sanghavi, Jainum
Published: (2026)
Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability
by: Hao, Tianxiang, et al.
Published: (2023)
by: Hao, Tianxiang, et al.
Published: (2023)
More Clear, More Flexible, More Precise: A Comprehensive Oriented Object Detection benchmark for UAV
by: Ye, Kai, et al.
Published: (2025)
by: Ye, Kai, et al.
Published: (2025)
Adapted Center and Scale Prediction: More Stable and More Accurate
by: Wang, Wenhao, et al.
Published: (2020)
by: Wang, Wenhao, et al.
Published: (2020)
Do More Details Always Introduce More Hallucinations in LVLM-based Image Captioning?
by: Feng, Mingqian, et al.
Published: (2024)
by: Feng, Mingqian, et al.
Published: (2024)
Alignment and Adversarial Robustness: Are More Human-Like Models More Secure?
by: Hoak, Blaine, et al.
Published: (2025)
by: Hoak, Blaine, et al.
Published: (2025)
Leaner Transformers: More Heads, Less Depth
by: Saratchandran, Hemanth, et al.
Published: (2025)
by: Saratchandran, Hemanth, et al.
Published: (2025)
Less is More: Skim Transformer for Light Field Image Super-resolution
by: Hu, Zeke Zexi, et al.
Published: (2024)
by: Hu, Zeke Zexi, et al.
Published: (2024)
The More You See in 2D, the More You Perceive in 3D
by: Han, Xinyang, et al.
Published: (2024)
by: Han, Xinyang, et al.
Published: (2024)
Larger than memory image processing
by: Sporring, Jon, et al.
Published: (2026)
by: Sporring, Jon, et al.
Published: (2026)
Vision-Language Models Generate More Homogeneous Stories for Phenotypically Black Individuals
by: Lee, Messi H. J., et al.
Published: (2024)
by: Lee, Messi H. J., et al.
Published: (2024)
ForensicZip: More Tokens are Better but Not Necessary in Forensic Vision-Language Models
by: Lai, Yingxin, et al.
Published: (2026)
by: Lai, Yingxin, et al.
Published: (2026)
Seeing More, Saying More: Lightweight Language Experts are Dynamic Video Token Compressors
by: Wang, Xiangchen, et al.
Published: (2025)
by: Wang, Xiangchen, et al.
Published: (2025)
More Pictures Say More: Visual Intersection Network for Open Set Object Detection
by: Dong, Bingcheng, et al.
Published: (2024)
by: Dong, Bingcheng, et al.
Published: (2024)
Fine-Tuning Image-Conditional Diffusion Models is Easier than You Think
by: Garcia, Gonzalo Martin, et al.
Published: (2024)
by: Garcia, Gonzalo Martin, et al.
Published: (2024)
Pruning One More Token is Enough: Leveraging Latency-Workload Non-Linearities for Vision Transformers on the Edge
by: Eliopoulos, Nick John, et al.
Published: (2024)
by: Eliopoulos, Nick John, et al.
Published: (2024)
Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More
by: Wang, Feng, et al.
Published: (2025)
by: Wang, Feng, et al.
Published: (2025)
Floating No More: Object-Ground Reconstruction from a Single Image
by: Man, Yunze, et al.
Published: (2024)
by: Man, Yunze, et al.
Published: (2024)
Reduce the Artifacts Bias for More Generalizable AI-Generated Image Detection
by: Li, Yiheng, et al.
Published: (2026)
by: Li, Yiheng, et al.
Published: (2026)
An Image is Worth More Than 16x16 Patches: Exploring Transformers on Individual Pixels
by: Nguyen, Duy-Kien, et al.
Published: (2024)
by: Nguyen, Duy-Kien, et al.
Published: (2024)
Less-to-More Generalization: Unlocking More Controllability by In-Context Generation
by: Wu, Shaojin, et al.
Published: (2025)
by: Wu, Shaojin, et al.
Published: (2025)
A Little More Like This: Text-to-Image Retrieval with Vision-Language Models Using Relevance Feedback
by: Khaertdinov, Bulat, et al.
Published: (2025)
by: Khaertdinov, Bulat, et al.
Published: (2025)
Towards More Accurate Personalized Image Generation: Addressing Overfitting and Evaluation Bias
by: Li, Mingxiao, et al.
Published: (2025)
by: Li, Mingxiao, et al.
Published: (2025)
How to Learn More? Exploring Kolmogorov-Arnold Networks for Hyperspectral Image Classification
by: Jamali, Ali, et al.
Published: (2024)
by: Jamali, Ali, et al.
Published: (2024)
Similar Items
-
More than the Sum: Panorama-Language Models for Adverse Omni-Scenes
by: Fan, Weijia, et al.
Published: (2026) -
More than a Moment: Towards Coherent Sequences of Audio Descriptions
by: Khandelwal, Eshika, et al.
Published: (2025) -
Vision Transformers Need More Than Registers
by: Shi, Cheng, et al.
Published: (2026) -
Opinion: Learning Intuitive Physics May Require More than Visual Data
by: Su, Ellen, et al.
Published: (2025) -
More than the Sum of Its Parts: Ensembling Backbone Networks for Few-Shot Segmentation
by: Catalano, Nico, et al.
Published: (2024)