Saved in:
Bibliographic Details
Main Authors: Singh, Abhineet, Rozeboom, Justin, Ray, Nilanjan
Format: Preprint
Published: 2026
Subjects:
Online Access:https://arxiv.org/abs/2602.21627
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1866911593427632128
author Singh, Abhineet
Rozeboom, Justin
Ray, Nilanjan
author_facet Singh, Abhineet
Rozeboom, Justin
Ray, Nilanjan
contents This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain-specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to foster future competitiveness in general-purpose applications, and facilitate this by making our code and models publicly available.
format Preprint
id arxiv_https___arxiv_org_abs_2602_21627
institution arXiv
publishDate 2026
record_format arxiv
spellingShingle Tokenizing Semantic Segmentation with Run Length Encoding
Singh, Abhineet
Rozeboom, Justin
Ray, Nilanjan
Computer Vision and Pattern Recognition
This paper presents a new unified approach to semantic segmentation in both images and videos by using language modeling to output the masks as sequences of discrete tokens. We use run length encoding (RLE) to discretize the segmentation masks, and adapt the Pix2Seq framework to learn autoregressive models to output these tokens. We propose novel tokenization strategies to compress the lengths of the token sequences to make it practicable to extend this approach to videos. We also show how instance information can be incorporated into the tokenization process to perform panoptic segmentation. We evaluate our models on two domain-specific datasets to demonstrate their competitiveness with the state of the art in certain scenarios, in spite of being severely bottlenecked by our limited computational resources. We supplement these analyses by proposing several promising approaches to foster future competitiveness in general-purpose applications, and facilitate this by making our code and models publicly available.
title Tokenizing Semantic Segmentation with Run Length Encoding
topic Computer Vision and Pattern Recognition
url https://arxiv.org/abs/2602.21627