[Read Paper] Systolic Arrays for VLSI

Published: by Creative Commons Licence (Last updated: )

Systolic Arrays for VLSI

This paper is also published as Algorithms for VLSI Processor Arrays at introduction to VLSI systems. It proposes new multiprocessor structures and parallel algorithms for processing some basic matrix computations which are capable of pipelining matrix computations with optimal speed-up.

The pipelining aspect of the algorithms is most effective for band matrices with long bands.

By performance, this paper holds the view that time spent in I/O, control and data movement as well as arithmetic must all be considered.

The Basic Components and Array Structures

The Inner Product Step Processor

The single operation considered is inner product step $R_C ← R_C + R_A × R_B$. But we can implementation other kinds of "multiplication" or "add".

This processor should makes the input values for $R_A$, $R_B$, and $R_C$, respectively; computes the inner product $R_C ← R_C + R_A × R_B$; and makes the inputs values for $R_A$, $R_B$, together with the new value of $R_C$ available as outputs on the output lines.

Here are two types of the inner product step processor.

Geometries for the  inner product step processor

Processor Arrays

The basic network organization we shall adopt for processors is mesh-connected and all connections from a processor are to neighboring processors.

Processor Arrays

Processors lying on the boundary of the processor array may have external connections to the host memory.

Matrix-Vector Multiplication

Matrix-Vector Multiplication

Linearly connected network for the Matrix-Vector Multiplication

$w = p + q - 1$.

The utilization ratio is less than $50\%$ with $O(2n + w)$ time units, compared to $O(wn)$ time for the sequential algorithm on a uniprocessor.

Matrix Multiplication on a Hexagonal Array

Matrix Multiplication

Matrix Multiplication on a Hexagonal Array

The utilization ratio is less than $1\over3$ with $O(3n + min(w_1, w_2))$ time units.

The LU-Decomposition of a Matrix on a Hexagonal Array

The LU-Decomposition of a Matrix

The LU-Decomposition of a Matrix on a Hexagonal Array

It can compute in $O(3n + min(p, q))$ time. If $A$ is an $n × n$ dense matrix, then the computation time is $4n$.

Triangular Linear Systems

Triangular Linear Systems

Triangular Linear Systems in linearly connected network