[Read Paper] EIE: Efficient Inference Engine on Compressed Deep Neural Network
EIE: Efficient Inference Engine on Compressed Deep Neural Network
Exploit the Sparsity of Activations with Compressed Sparse Column (CSC) Format
For each column
Parallelizing Compressed DNN
perform the sparse matrix sparse vector operation by scanning vector
The interleaved CSC representation allows each PE to quickly find the non-zeros in each column to be multiplied by
Hardware Implementation
- Activation Queue and Load Balancing: The broadcast is disabled if any PE has a full queue. At any point in time, each PE processes the activation at the head of its queue.
- Pointer Read Unit:
and will always be in different banks. - Sparse Matrix Read Unit: uses pointers
and to read the non-zero elements - Arithmetic Unit: performs the multiply-accumulate operation
- Activation Read/Write: contains two activation register files that accommodate the source and destination activation values respectively during a single round of FC layer computation
- Distributed Leading Non-Zero Detection: select the first non-zero result. The result is sent to a Leading Nonzero Detection Node (LNZD Node)
- Central Control Unit: It communicates with the master and monitors the state of every PE by setting the control registers. There are two modes in the Central Unit: I/O and Computing.
- In the I/O mode, all of the PEs are idle while the activations and weights in every PE can be accessed by a DMA connected with the Central Unit. This is a one-time cost.
- In the Computing mode, the CCU repeatedly collects a non-zero value from the LNZD quadtree and broadcasts this value to all PEs.
- In the I/O mode, all of the PEs are idle while the activations and weights in every PE can be accessed by a DMA connected with the Central Unit. This is a one-time cost.
brought the critical path delay down to 1.15ns by introducing 4 pipeline stages to update one activation:
- codebook lookup and address accumulation (in parallel)
- output activation read and input activation multiply (in parallel)
- shift and add
- output activation write.
Design Space Exploration
Queue Depth. The activation FIFO queue deals with load imbalance between the PEs. A deeper FIFO queue can better decouple producer and consumer.
SRAM Width. We choose an SRAM with a 64-bit interface to store the sparse matrix (Spmat
) since it minimized the total energy. Wider SRAM interfaces reduce the number of total SRAM accesses but increase the energy cost per SRAM read.
Arithmetic Precision. We use a 16-bit fixed-point arithmetic. 16-bit fixed-point multiplication consumes 5 times less energy than 32-bit fixed-point and 6.2 times less energy than 32-bit floating-point.