Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Gomes, Alan, Gonçalves, Anderson, Santos, Samuel Felipe dos, Alves, Nathan Felipe, de Moura, Magna Soelma Beserra, Alberton, Bruna de Costa, Morellato, Leonor Patricia C., Torres, Ricardo da Silva, Almeida, Jurandy
Format:	Preprint
Published:	2026
Subjects:	Computer Vision and Pattern Recognition
Online Access:	https://arxiv.org/abs/2605.00296
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866917452490735616
author	Gomes, Alan Gonçalves, Anderson Santos, Samuel Felipe dos Alves, Nathan Felipe de Moura, Magna Soelma Beserra Alberton, Bruna de Costa Morellato, Leonor Patricia C. Torres, Ricardo da Silva Almeida, Jurandy
author_facet	Gomes, Alan Gonçalves, Anderson Santos, Samuel Felipe dos Alves, Nathan Felipe de Moura, Magna Soelma Beserra Alberton, Bruna de Costa Morellato, Leonor Patricia C. Torres, Ricardo da Silva Almeida, Jurandy
contents	Plant phenology-the study of recurrent life cycle events-is essential for understanding ecosystem dynamics and their responses to climate change impacts. While Unmanned Aerial Vehicles (UAVs) and near-surface cameras enable high-resolution monitoring, identifying plant species across time remains computationally challenging. State-of-the-art approaches, specifically Multi-Temporal Convolutional Networks (CNNs), rely on rigid multi-branch architectures that scale poorly with longer time series and require large spatial context windows. In this paper, we present an extensive study on optimizing Vision Transformers (ViTs) for efficient spatio-temporal vegetation pixel classification. We conducted a comprehensive ablation study analyzing seven key design dimensions, including: (i) data normalization; (ii) spectral arrangement; (iii) boundary handling; (iv) spatial context window shape and size; (v) tokenization strategies; (vi) positional encoding; and (vii) feature aggregation strategies. Our method was evaluated on two datasets from the Brazilian Cerrado biome, Serra do Cipó (aerial imagery) and Itirapina (near-surface imagery). Experimental results demonstrate that our ViT approach offers a substantial improvement in computational efficiency while maintaining competitive classification performance. Notably, our ViT reduces Floating Point Operations (FLOPs) by an order of magnitude and maintains constant parameter complexity regardless of the time series length, whereas the CNN baseline scales linearly. Our findings confirm that ViTs are a robust, scalable solution for resource-constrained phenological monitoring systems.
format	Preprint
id	arxiv_https___arxiv_org_abs_2605_00296
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Efficient Spatio-Temporal Vegetation Pixel Classification with Vision Transformers Gomes, Alan Gonçalves, Anderson Santos, Samuel Felipe dos Alves, Nathan Felipe de Moura, Magna Soelma Beserra Alberton, Bruna de Costa Morellato, Leonor Patricia C. Torres, Ricardo da Silva Almeida, Jurandy Computer Vision and Pattern Recognition Plant phenology-the study of recurrent life cycle events-is essential for understanding ecosystem dynamics and their responses to climate change impacts. While Unmanned Aerial Vehicles (UAVs) and near-surface cameras enable high-resolution monitoring, identifying plant species across time remains computationally challenging. State-of-the-art approaches, specifically Multi-Temporal Convolutional Networks (CNNs), rely on rigid multi-branch architectures that scale poorly with longer time series and require large spatial context windows. In this paper, we present an extensive study on optimizing Vision Transformers (ViTs) for efficient spatio-temporal vegetation pixel classification. We conducted a comprehensive ablation study analyzing seven key design dimensions, including: (i) data normalization; (ii) spectral arrangement; (iii) boundary handling; (iv) spatial context window shape and size; (v) tokenization strategies; (vi) positional encoding; and (vii) feature aggregation strategies. Our method was evaluated on two datasets from the Brazilian Cerrado biome, Serra do Cipó (aerial imagery) and Itirapina (near-surface imagery). Experimental results demonstrate that our ViT approach offers a substantial improvement in computational efficiency while maintaining competitive classification performance. Notably, our ViT reduces Floating Point Operations (FLOPs) by an order of magnitude and maintains constant parameter complexity regardless of the time series length, whereas the CNN baseline scales linearly. Our findings confirm that ViTs are a robust, scalable solution for resource-constrained phenological monitoring systems.
title	Efficient Spatio-Temporal Vegetation Pixel Classification with Vision Transformers
topic	Computer Vision and Pattern Recognition
url	https://arxiv.org/abs/2605.00296

Similar Items