[Read Paper] Deep Learning Hardware: Past, Present, and Future

Published: July 05, 2020 by SingularityKChen (Last updated: July 05, 2020 )

Categories:
ReadPaper 30

Deep Learning Hardware: Past, Present, and Future

Deep Learning Hardware: Past, Present, and Future

This paper introduces the following aspects:

identifies trends in deep learning research that will influence hardware architectures and software platforms of the future
Five DL use cases with different hardware requirements
Present and Future Deep-Learning Architectures
Requirements for Future DL Hardware and Software

Past

Lose popularity in the mid-1990s

What caused it to lose popularity within the research community in the mid-1990s?

The performance of computers at the time;
The small number of applications for which collecting large labeled datasets was cost effective;
The effort involved in developing flexible neural net simulators;
The reluctance of many research institutions at the time to distribute open source software.

Sudden Resurgence around 2013

What sparked its sudden resurgence around 2013? There are four main factors:

Improved methods;
Larger datasets with many samples and many categories;
Low-cost TFLOPS-class general-purpose GPUs (GPGPUs);
Open-source libraries with interpreted language frontends (Torch, Theano, cuda-convNet, Caffe).
The first three of these enabled record-breaking results in image recognition and speech recognition;
The last one allowed these results to be easily replicated because they incorporated all the engineering “tricks” necessary to get DL models to work.

The success of GPGPU for DL

What has accounted for the success of GPGPU for DL

Wide availability
Generality
Programmability
well-supported software stacks

The unsuccess of dedicated hardware architectures

Why none of dedicated hardware architectures for neural networks were successful

Lacked flexibility
Were designed for particular types of neural networks that had no proven practical use

The Need for DL Hardware: five use cases

The trend is to rely increasingly on unsupervised, self-supervised, weakly supervised or multi-task learning, for which larger networks perform even better.

There are five use cases with different hardware requirements.

DL research and development

HPC type multi-node machines
The communication network must be high bandwidth and low latency to allow for the parallelization of training large models on large datasets
Using FP-32 is necessary because one must be sure that when an experiment fails, it is not because of a lack of numerical accuracy
The best architectures are those that can be saturated with the smallest batch of samples
Price and power consumption are relatively secondary to performance and flexibility

Off-line training of DL models for production

Retraining with new data
It is possible to perform routine training on specialized hardware with reduced-precision arithmetic

Inference on servers in data centers

Many are relatively “simple” neural networks with sparse inputs. But a lot of computation goes into larger ConvNets.
Power consumption and cost are important, flexibility and raw performance are secondary, communication latency is unimportant.
The ideal architecture is a specialized DL-inference accelerator sitting in a standard data-center server node
Applications
- Newsfeed, advertisement ranking, text classification
- Image, video, and speech understanding, as well as for language translation

Inference on mobile devices and embedded systems

DL-inference accelerators with very-low power consumption
Real-time tasks require that the DL system be run on the device without the latency of a round-trip to a server
Applications
- feature tracking and 3D reconstruction for AR
- object segmentation/recognition
- OCR in natural scenes
- real-time language translation
- speech-based virtual assistants

On-line learning on servers and mobile devices

Typical Basic DL Modules

Multiple convolutions in 1D, 2D, and 3D;
Linear operators (matrices);
Linear operators applied to sparse inputs (word embedding lookup tables for NLP);
Divisive normalization;
Element-wise functions;
Pooling/subsampling;
Element-wise operators;
Bilinear operators (multiplicative interactions for attention);

Future

Architectural Elements of Future DL Systems

Dynamic Networks, Differentiable Programming

The network architecture is dynamic and changes for every new data point. Needs back-propagates (autograd).
Applications
- For natural-language processing,
- For data that does not come in the form of a fixed-sized tensor,
- For systems that need to activate parts of a large network on demand in a data-dependent way (such as the Multi-scale Densenet architecture)
- For “reasoning” networks whose output is another network specifically designed to answer a particular question

Neural Networks on Graphs

Many problems are difficult to represent with fixed-size tensors or variable-length sequences of tensors, but are better represented by graphs whose arcs and nodes are annotated by tensors.
Convolution operations can easily be defined on irregular graphs: they are defined as diagonal operators in the eigenspace of the graph Laplacian, which is a generalization of the Fourier transform
Such networks for a wide variety of applications, which are likely to violate the assumptions of current DL hardware
Applications
- 3D meshes, social networks, gene-regulation networks, and chemical molecules

Graph Embedding Networks

DL is used for largescale embedding of knowledge bases
Applications
- Recommender systems
- Use hyperbolic metric spaces to represent hierarchical categories

Memory-Augmented Networks

To endow DL systems with the ability to reason, they need a short-term memory, to be used as an episodic memory, or a scratchpad/working memory.
if a system is to answer questions about a series of events, it must be able to store the story in a memory and retrieve the relevant bits to answer a particular question

Complex Inference and Search

the output variable actually be an input to a scoring network whose scalar output (akin to energy) indicates the incompatibility between the input and an output proposal.
An inference procedure must search for the output value that minimizes the energy. This type of model is called an energy-based model.
If the energy minimizing inference procedure is gradient-based, inference hardware will need to support back-propagation.

Sparse Activations

The modules’ activations will become increasingly sparse, with only a subset of variables of a subset of modules being activated at any one time.
Need hardware support most neurons are quiet most of the time, which is good for power dissipation

Self-Supervised Learning (Generative Adversarial Networks)

Requirements for Future DL Software

what is needed is a software framework for differentiable programming that is both interactive, flexible, dynamic, and efficient.
It is often impractical to develop high-volume applications or embedded applications that rely on Python at runtime.
- For static compute graphs, there is no issue: one can export the graph to adhere to a standard format, such ONNX (Open Neural Net Exchange), and use one of the numerous ONNX-compliant backends.
- For dynamic networks, there are two main options: One is to provide a compiler; another option is to design a suitable compilable language from scratch.

Requirements for Future DL Hardware for Training

One problem is that sparsity, architecture dynamicity, and modules that manipulate non-tensor data (graphs), break the assumption that one can perform computation on batches of identically-sized samples.
- Need new hardware architectures that can function efficiently with a batch size of one.
handling sparse structured data is another requirement.
- When most units are off most of the time, it may become advantageous to make our hardware event driven, so that only the units that are activated consume resources

Requirements for Future DL Hardware for Inference

require extremely low-power ASICs for DL inference for such things as real-time/low-latency object tracking, 3D re-construction, instance labelling, facial reconstruction, predictive compression and display.
the solution to power constraints may well be the exploitation of sparse activations, perhaps using event-based computation