[Read Paper] Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
This article contains more details of Eyeriss V2
than another Eyeriss V2 paper.
And here is the list of Eyeriss
series papers:
- Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices
- Eyeriss v2: A Flexible and High-Performance Accelerator for Emerging Deep Neural Networks
- Eyeriss: A Spatial Architecture for Energy-Efficient Dataflow for Convolutional Neural Networks
- Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks
Hierarchical Architecture of Eyeriss
Top Level
The PEs and GLBs are grouped into clusters to support a flexible on-chip network (NoC) that connects the GLBs to the PEs at low cost.
Hierarchical Mesh Network (HM-NoC)
The HM-NoC can be configured into four different modes depending on the data reuse opportunity and bandwidth requirements.
- In the high bandwidth mode, each GLB bank or off-chip data I/O can deliver data independently to the PEs in the cluster, which achieves unicast.
- In the high reuse mode, data from the same source can be routed to all PEs in different clusters, which achieves broadcast.
- For situations where the data reuse cannot fully utilize the entire PE array with broadcast, different multicast modes, specifically grouped-multicast and interleaved multicast, can be adapted according to the desired multicast patterns.
HM-NoC adapts different modes for different types of layers:
- Conventional CONV layers: In normal CONV layers, there is plenty of data reuse for both iacts and weights. To keep all 4 PEs busy at the lowest bandwidth requirement, we need 2 iacts and 2 weights from the data source (ignoring the reuse from SPad). In this case, either the HM-NoC for iact or weight has to be configured into the grouped-multicast mode, while the other one configured into the interleaved-multicast mode.
- Depth-wise (DP) CONV layers: For DP CONV layers, there can be nearly no reuse for iacts due to the lack of output channels. Therefore, we can only exploit the reuse of weights by broadcasting the weights to all PEs while fetching unique iacts for each PE.
- Fully-connected (FC) layers: Contrary to the DP CONV layers, FC layers usually see little reuse for weights, especially when the batch size is limited. In this case, the modes of iact and weight NoCs are swapped from the previous one: the weights are now unicast to the PEs while the iacts are broadcast to all PEs.
Eyeriss v2 PE Architecture
To handle these dependencies while still maintaining throughput, the PE is implemented using seven pipeline stages and five SPads:
-
The first two pipeline stages are responsible for fetching non-zero iacts from the SPads.
- After a non-zero iact is fetched, the next three pipeline stages read the corresponding weights.
- The final two stages in the pipeline perform the MAC computation on the fetched non-zero iact and weight and then send the updated psum either back to the psum SPad or out of the PE.
- The iact address SPad stores the address vector of the CSC compressed iacts, which is used to address the reads from the iact data SPad that holds the non-zero data vector as well as the count vector.
- There is a weight address SPad to address the reads from the weight data SPad for the correct column of weights.
- There is a psum SPad
And this is the data format this architecture uses. But I will use a more simple one.
data | row | col |
---|---|---|
a | 1 | 0 |
b | 2 | 0 |
c | 0 | 1 |
d | 1 | 1 |
e | 3 | 1 |
f | 2 | 2 |
g | 3 | 4 |
h | 1 | 5 |
i | 3 | 5 |
j | 0 | 7 |
k | 1 | 7 |
l | 2 | 7 |