[Weekly Review] 2020/01/27-02/02

Published: February 02, 2020 by SingularityKChen (Last updated: March 07, 2020 )

Categories:
WeeklyReview 81

2020/01/27-2020/02/02

2020/01/27-2020/02/02

This week I read a lot of materials related to the state-of-art chip architectures mainly from HPML 2019 and hot-chip 2019. Although the progress of FYP was a little slow this week, I figured out the dataflow and part of control logics of Eyeriss V2, which was rather tricky and time-costing, but does benefit for my next week's implementation. Also, I finished the translation of Chisel-Bootcamp chapter 3 from English to Chinese.

And next week, I'll back to the implementation of Eyeriss. And if I have time, I need to read the papers left weeks ago.

HotChip19¹

AMD²

Accelerated core IP

Chip-let architecture

High-speed coherent interconnects

System and software co-optimization

Continued technology scaling

Co-designing architecture and infrastructure¹

Why?

Necessary due to End of Moore’s Law
Architect infrastructure for usability and scale
Leverage all areas of expertise

DLA Co-design:

ML research:

Computational requirements for cutting-edge models.
Input/output data feed rates
Types of operations to accelerate
Latency and bw requirements
Trains without loss scaling, unlike float 16
Latest computational requirements
Size and scope of models

ASIC:

Small mantissa reduces multiplier power, area

Software:

Proper flexibility and programmability
Controllability
Same dynamic range as float 32, same Inf/NaN behavior
Programmability for parallelism infrastructure

Compilers:

JIT code-gen & parallel IRs

Systems:

Power delivery, board space.
Board layout
Thermal limits
Liquid cooling

Data Center:

Cooling, buildability
Wiring and serviceability
Performance metrics
Space and network provisioning
Network requirements
Power delivery

System Co-design

Storage systems:

high bandwidth disks, network

Accelerators:

infeed & asynchronous abstractions

Software:

high throughput, software pipelining, horizontal scaling.

NVIDIA³

Network-on-Package and Network-on-Chip

Use GRS for inter-chip communication

Tiled architecture with distributed memory

Scalable DL inference accelerator

Scaling DL inference across NoP/NoC

HUAWEI⁴

DaVinci Core

1D Scalar Unit + 2D Vector Unit + 3D Matrix Unit + 3DSRAM

Building Blocks and their Computation Intensity

Memory Wall & I/O Wall

3D IC to alleviate memory wall, IO wall and logic wall

Technology challenges —Why do we need 3DIC

Mobile AP: LoL+ MoL

Ayar Labs⁵

HPE Optical Module

Chip-to-chip communications requires photonics to overcome I/O bottleneck

Emerging chip-let ecosystem offers opportunity for in-package optics

In -package optics fundamentally breaks the traditional bandwidth-distance trade-off and supports new high-performance computer architectures

TSMC⁶

EoML: end of Moore's Law

EoM

COMPUTE-MEMORY INTEGRATION

2D System

2.5D System

3D TSV System

N3XT System

New Memory System

NVIDIA⁷

Turing SM

UPMEM⁸

MLPerf⁹

Training Metric

Inference Metric

habana ¹⁰ ¹¹ ¹²

Designed to scale

Current HPC Challenges

Large message collective communication and reductions¹⁰

Limited network bandwidth¹¹

Parallel data access and computation¹¹

Low latency (in inference)¹²

Most DL frameworks are optimized for single-node training¹⁰

Support different DNN models and layers¹²

Fully utilize HPC clusters¹⁰

Memory Wall & I/O Wall⁴

New features in Improving

Utilize photonics to overcome network BW bottleneck⁵

Designed to scale¹³, ¹⁴

Co-design the support at runtime level and exploit it at the DL framework level¹⁰

Distributed (Parallel) Training¹⁰

Platform-aware compression and quantization

Dynamic adaptive resource management¹⁵

Software environment with a subset of hardware platform

Reference of this week

“HC31 (2019),” Hot Chips: A Symposium on High Performance Chips, 18-Aug-2019. [Online]. Available: https://www.hotchips.org/archives/2010s/hc31/. [Accessed: 31-Jan-2020]. ↩ ↩²
L. Su, “Delivering the Future of High-Performance Computing,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–43, doi: 10.1109/HOTCHIPS.2019.8875685. ↩
R. Venkatesan et al., “A 0.11 PJ/OP, 0.32-128 Tops, Scalable Multi-Chip-Module-Based Deep Neural Network Accelerator Designed with A High-Productivity vlsi Methodology,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–24, doi: 10.1109/HOTCHIPS.2019.8875657. ↩
H. Liao, J. Tu, J. Xia, and X. Zhou, “DaVinci: A Scalable Architecture for Neural Network Computing,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–44, doi: 10.1109/HOTCHIPS.2019.8875654. ↩ ↩²
M. Wade, “TeraPHY: A Chiplet Technology for Low-Power, High-Bandwidth in-Package Optical I/O,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. i–xlviii, doi: 10.1109/HOTCHIPS.2019.8875658. ↩ ↩²
H.-S. P. Wong, R. Willard, and I. K. Bell, “IC Technology – What Will the Next Node Offer Us?,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–52, doi: 10.1109/HOTCHIPS.2019.8875692. ↩
J. Burgess, “RTX ON – The NVIDIA TURING GPU,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–27, doi: 10.1109/HOTCHIPS.2019.8875651. ↩
F. Devaux, “The true Processing In Memory accelerator,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–24, doi: 10.1109/HOTCHIPS.2019.8875680. ↩
P. Mattson, “ML Benchmark Design Challenges,” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–36, doi: 10.1109/HOTCHIPS.2019.8875660. ↩
Habana Labs Ltd., “Habana Gaudi Training Platform whitepaper.” Jun-2019. ↩ ↩² ↩³ ↩⁴ ↩⁵ ↩⁶
E. Medina, “[Habana Labs presentation],” in 2019 IEEE Hot Chips 31 Symposium (HCS), 2019, pp. 1–29, doi: 10.1109/HOTCHIPS.2019.8875670. ↩ ↩² ↩³
“Training,” Habana. [Online]. Available: https://habana.ai/training/. [Accessed: 01-Feb-2020]. ↩ ↩² ↩³
“HPML 2019: HIGH PERFORMANCE MACHINE LEARNING.” [Online]. Available: https://hpml2019.github.io/#keynote. [Accessed: 28-Jan-2020]. ↩
J. Keuper and F.-J. Preundt, “Distributed Training of Deep Neural Networks: Theoretical and Practical Limits of Parallel Scalability,” in 2016 2nd Workshop on Machine Learning in HPC Environments (MLHPC), 2016, pp. 19–26, doi: 10.1109/MLHPC.2016.006. ↩
V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient Processing of Deep Neural Networks: A Tutorial and Survey,” Proc. IEEE, vol. 105, no. 12, pp. 2295–2329, Dec. 2017, doi: 10.1109/JPROC.2017.2761740. ↩

2020/01/27-2020/02/02

HotChip191

AMD2

Co-designing architecture and infrastructure1