[Read Paper] EIE: Efficient Inference Engine on Compressed Deep Neural Network

Published: January 12, 2020 by SingularityKChen (Last updated: March 07, 2020 )

Categories:
ReadPaper 30

EIE: Efficient Inference Engine on Compressed Deep Neural Network

EIE: Efficient Inference Engine on Compressed Deep Neural Network

Efficient inference engine that works on the compressed deep neural network model

Exploit the Sparsity of Activations with Compressed Sparse Column (CSC) Format

For each column $W_j$ of the matrix $W$ we store a vector $v$ that contains the non-zero weights, and a second, equal-length vector $z$ that encodes the number of zeros before the corresponding entry in $v$. Each entry of $v$ and $z$ is represented by a four-bit value. If more than 15 zeros appear before a non-zero entry we add a zero in vector $v$.

Memory layout for the relative indexed

Parallelizing Compressed DNN

perform the sparse matrix sparse vector operation by scanning vector $a$ to find its next non-zero value $a_j$and broadcasting $a_j$ along with its index $j$ to all PEs. Each PE then multiplies $a_j$ by the non-zero elements in its portion of column $W_j$ — accumulating the partial sums in accumulators for each element of the output activation vector b. In the CSC representation, these non-zeros weights are stored contiguously so each PE simply walks through its $v$ array from location $p_j$ to $p_{j+1}-1$ to load the weights. To address the output accumulators, the row number $i$ corresponding to each weight $W_{ij}$ is generated by keeping a running sum of the entries of the x array.

The interleaved CSC representation allows each PE to quickly find the non-zeros in each column to be multiplied by $a_j$. This organization also keeps all of the computation except for the broadcast of the input activations local to a PE.

Hardware Implementation

Activation Queue and Load Balancing: The broadcast is disabled if any PE has a full queue. At any point in time, each PE processes the activation at the head of its queue.
Pointer Read Unit: $p_j$ and $p_{j+1}$ will always be in different banks.
Sparse Matrix Read Unit: uses pointers $p_j$ and $p_{j+1}$ to read the non-zero elements
Arithmetic Unit: performs the multiply-accumulate operation
Activation Read/Write: contains two activation register files that accommodate the source and destination activation values respectively during a single round of FC layer computation
Distributed Leading Non-Zero Detection: select the first non-zero result. The result is sent to a Leading Nonzero Detection Node (LNZD Node)
Central Control Unit: It communicates with the master and monitors the state of every PE by setting the control registers. There are two modes in the Central Unit: I/O and Computing.
- In the I/O mode, all of the PEs are idle while the activations and weights in every PE can be accessed by a DMA connected with the Central Unit. This is a one-time cost.
  - In the Computing mode, the CCU repeatedly collects a non-zero value from the LNZD quadtree and broadcasts this value to all PEs.

brought the critical path delay down to 1.15ns by introducing 4 pipeline stages to update one activation:

codebook lookup and address accumulation (in parallel)
output activation read and input activation multiply (in parallel)
shift and add
output activation write.

Design Space Exploration

Queue Depth. The activation FIFO queue deals with load imbalance between the PEs. A deeper FIFO queue can better decouple producer and consumer.

SRAM Width. We choose an SRAM with a 64-bit interface to store the sparse matrix (Spmat) since it minimized the total energy. Wider SRAM interfaces reduce the number of total SRAM accesses but increase the energy cost per SRAM read.

Arithmetic Precision. We use a 16-bit fixed-point arithmetic. 16-bit fixed-point multiplication consumes 5 times less energy than 32-bit fixed-point and 6.2 times less energy than 32-bit floating-point.