Staff View: :: Library Catalog

Saved in:

Bibliographic Details
Main Author:	Matthews, Devin A.
Format:	Preprint
Published:	2016
Subjects:	Mathematical Software Distributed, Parallel, and Cluster Computing Performance 15A69 G.4
Online Access:	https://arxiv.org/abs/1607.00291
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1866913754850000896
author	Matthews, Devin A.
author_facet	Matthews, Devin A.
contents	Tensor computations--in particular tensor contraction (TC)--are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication (MM) and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLIS framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of MM, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for MM in BLIS, with similar performance characteristics. The complexity of managing tensor-to-matrix transformations is also handled automatically in our approach, greatly simplifying its use in scientific applications.
format	Preprint
id	arxiv_https___arxiv_org_abs_1607_00291
institution	arXiv
publishDate	2016
record_format	arxiv
spellingShingle	High-Performance Tensor Contraction without Transposition Matthews, Devin A. Mathematical Software Distributed, Parallel, and Cluster Computing Performance 15A69 G.4 Tensor computations--in particular tensor contraction (TC)--are important kernels in many scientific computing applications. Due to the fundamental similarity of TC to matrix multiplication (MM) and to the availability of optimized implementations such as the BLAS, tensor operations have traditionally been implemented in terms of BLAS operations, incurring both a performance and a storage overhead. Instead, we implement TC using the flexible BLIS framework, which allows for transposition (reshaping) of the tensor to be fused with internal partitioning and packing operations, requiring no explicit transposition operations or additional workspace. This implementation, TBLIS, achieves performance approaching that of MM, and in some cases considerably higher than that of traditional TC. Our implementation supports multithreading using an approach identical to that used for MM in BLIS, with similar performance characteristics. The complexity of managing tensor-to-matrix transformations is also handled automatically in our approach, greatly simplifying its use in scientific applications.
title	High-Performance Tensor Contraction without Transposition
topic	Mathematical Software Distributed, Parallel, and Cluster Computing Performance 15A69 G.4
url	https://arxiv.org/abs/1607.00291

Similar Items