[Read Paper] EIE: Efficient Inference Engine on Compressed Deep Neural Network
EIE: Efficient Inference Engine on Compressed Deep Neural Network
Exploit the Sparsity of Activations with Compressed Sparse Column (CSC) Format
For each column $W_j$ of the matrix $W$ we store a vector $v$ that contains the non-zero weights, and a second, equal-length vector $z$ that encodes the number of zeros before the corresponding entry in $v$. Each entry of $v$ and $z$ is represented by a four-bit value. If more than 15 zeros appear before a non-zero entry we add a zero in vector $v$.
Parallelizing Compressed DNN
perform the sparse matrix sparse vector operation by scanning vector $a$ to find its next non-zero value $a_j$and broadcasting $a_j$ along with its index $j$ to all PEs. Each PE then multiplies $a_j$ by the non-zero elements in its portion of column $W_j$ — accumulating the partial sums in accumulators for each element of the output activation vector b. In the CSC representation, these non-zeros weights are stored contiguously so each PE simply walks through its $v$ array from location $p_j$ to $p_{j+1}-1$ to load the weights. To address the output accumulators, the row number $i$ corresponding to each weight $W_{ij}$ is generated by keeping a running sum of the entries of the x array.
The interleaved CSC representation allows each PE to quickly find the non-zeros in each column to be multiplied by $a_j$. This organization also keeps all of the computation except for the broadcast of the input activations local to a PE.
Hardware Implementation
- Activation Queue and Load Balancing: The broadcast is disabled if any PE has a full queue. At any point in time, each PE processes the activation at the head of its queue.
- Pointer Read Unit: $p_j$ and $p_{j+1}$ will always be in different banks.
- Sparse Matrix Read Unit: uses pointers $p_j$ and $p_{j+1}$ to read the non-zero elements
- Arithmetic Unit: performs the multiply-accumulate operation
- Activation Read/Write: contains two activation register files that accommodate the source and destination activation values respectively during a single round of FC layer computation
- Distributed Leading Non-Zero Detection: select the first non-zero result. The result is sent to a Leading Nonzero Detection Node (LNZD Node)
- Central Control Unit: It communicates with the master and monitors the state of every PE by setting the control registers. There are two modes in the Central Unit: I/O and Computing.
- In the I/O mode, all of the PEs are idle while the activations and weights in every PE can be accessed by a DMA connected with the Central Unit. This is a one-time cost.
- In the Computing mode, the CCU repeatedly collects a non-zero value from the LNZD quadtree and broadcasts this value to all PEs.
- In the I/O mode, all of the PEs are idle while the activations and weights in every PE can be accessed by a DMA connected with the Central Unit. This is a one-time cost.
brought the critical path delay down to 1.15ns by introducing 4 pipeline stages to update one activation:
- codebook lookup and address accumulation (in parallel)
- output activation read and input activation multiply (in parallel)
- shift and add
- output activation write.
Design Space Exploration
Queue Depth. The activation FIFO queue deals with load imbalance between the PEs. A deeper FIFO queue can better decouple producer and consumer.
SRAM Width. We choose an SRAM with a 64-bit interface to store the sparse matrix (Spmat
) since it minimized the total energy. Wider SRAM interfaces reduce the number of total SRAM accesses but increase the energy cost per SRAM read.
Arithmetic Precision. We use a 16-bit fixed-point arithmetic. 16-bit fixed-point multiplication consumes 5 times less energy than 32-bit fixed-point and 6.2 times less energy than 32-bit floating-point.