Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Authors:	Zhang, Yifan, Bi, Wei, Zhang, Kechi, Jin, Dongming, Fu, Jie, Jin, Zhi
Format:	Preprint
Published:	2026
Subjects:	Machine Learning Computation and Language
Online Access:	https://arxiv.org/abs/2601.05770
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913171503054848
author	Zhang, Yifan Bi, Wei Zhang, Kechi Jin, Dongming Fu, Jie Jin, Zhi
author_facet	Zhang, Yifan Bi, Wei Zhang, Kechi Jin, Dongming Fu, Jie Jin, Zhi
contents	Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo recovery of executable mechanisms from weights without relying on human-written target programs. However, applying this paradigm to Transformer is complicated by representation entanglement (e.g., superposition), where features encoded in overlapping directions substantially hinder the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to the RNN-based MIPS baseline on shared discrete tasks, while broadening extraction to tasks with continuous-valued intermediate computations. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a controllable testbed for algorithm extraction and Transformer interpretability.
format	Preprint
id	arxiv_https___arxiv_org_abs_2601_05770
institution	arXiv
publishDate	2026
record_format	arxiv
spellingShingle	Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer Zhang, Yifan Bi, Wei Zhang, Kechi Jin, Dongming Fu, Jie Jin, Zhi Machine Learning Computation and Language Algorithm extraction aims to synthesize executable programs directly from models trained on algorithmic tasks, enabling de novo recovery of executable mechanisms from weights without relying on human-written target programs. However, applying this paradigm to Transformer is complicated by representation entanglement (e.g., superposition), where features encoded in overlapping directions substantially hinder the recovery of symbolic expressions. We propose the Discrete Transformer, an architecture explicitly designed to bridge the gap between continuous representations and discrete symbolic logic. By injecting discreteness through temperature-annealed sampling, our framework effectively leverages hypothesis testing and symbolic regression to extract human-readable programs. Empirically, the Discrete Transformer achieves performance comparable to the RNN-based MIPS baseline on shared discrete tasks, while broadening extraction to tasks with continuous-valued intermediate computations. Finally, we show that architectural inductive biases provide fine-grained control over synthesized programs, establishing the Discrete Transformer as a controllable testbed for algorithm extraction and Transformer interpretability.
title	Weights to Code: Extracting Interpretable Algorithms from the Discrete Transformer
topic	Machine Learning Computation and Language
url	https://arxiv.org/abs/2601.05770

Similar Items