Accelerating deep learning with high energy efficiency: From microchip to physical systems

There is no author summary for this article yet. Authors can add summaries to their articles on ScienceOpen to make them more accessible to a non-specialist audience.

Abstract

In the era of digits and internet, massive data have been continuously generated from a variety of sources, including video, photo, audio, text, internet of things, etc. It is intuitive that more accurate patterns can be obtained by feeding more data for effective analysis; despite the data redundancy, a clearer picture can be delineated for better decision-making. However, traditional methods, even in machine learning, do not benefit from the expanding amount of data, whose performance nearly saturates when the data collection is large enough (Figure 1A). Such a dilemma emerges due to their limited capability and insufficient supply of computation power in the past. Figure 1 Diagram of deep learning realizations (A) Comparisons between deep-learning and conventional methods on how the amount of the data affects the performance. (B) Procedures (inference, backward propagation including error/gradient calculation and parameter update) contained in deep learning could be accelerated. The side bar denotes the energy cost for the inference (90%) and training (10%) phases. (C) Development trend of the DNN accelerator from pure microchip to physical system. MZI, Mach-Zehnder interferometers. A breakthrough was pinned in 2012 with deep learning: a deep neural network (DNN) can be effectively trained on graphics-processing units (GPUs) of phenomenal performance. 1 The DNN mimics the biological neural networks in brains with layer-stacked transformations of sufficient complexity to approximate arbitrary functions. Major mathematical operations in DNNs, such as matrix multiplication, convolution, and other customized repetitive computations, can be engineered into an in-parallel configuration, which matches well with the computational mechanism of GPUs. Processing efficiency is therefore boosted and outperforms previous realizations done with serial computation units (e.g., central-processing units [CPUs]). Deep learning has inspired a broad range of applications in computer vision (CV), natural language processing (NLP), biomedicine, games, and many others. Computational requirements from these fast-expanding applications of DNNs, on the other hand, significantly outstrip the development of the chips on silicon (i.e., Moore’s law). The current computational facilities of deep learning are accompanied with huge energy costs. Therefore, DNN accelerators are urgently demanded for fast processing and energy efficiency. The learning accelerators can be either general- or specific purpose, with implementations based on microchip/conventional electronics (digital) or physical systems (analog) by tackling the notorious “memory wall” and “power wall.” Orders-of-magnitude improvements have been achieved regarding both processing speed and energy efficiency. These achievements significantly shorten the implementations of large DNNs, even fed with millions or trillions of data. For now, an accelerated DNN for image classification or a language translator can be trained to work within a couple of hours; without these accelerators, the same task may take several weeks. Such convincing efficiency has made these artificial intelligence (AI)-specific technologies more accessible and significantly accelerates the transfer of deep-learning-based products toward commercialization. Two essential phases, namely training and inference (Figure 1B), are targeted to speed up the learning process. Different requirements pose for these two phases. The training phase involves the feeding of massive data into the DNN for designed tasks to update and finalize the parameters of the DNN via backward propagation optimization. After training, the DNN with fixed (or optimized) parameters comes into the inference phase to predict the desired result with the input data. In many applications, training is a one-time effort, but the functioning of the trained DNN in inference occupies significantly longer durations. Hence, inference consumes up to 90% of the energy cost in deep learning, 2 which is the very concern for most accelerators. In inference, most of the processing time is occupied by matrix multiplication, whose mathematical isomorphism has been devised for hardware acceleration. The most widely used and easiest-to-access chips are the general GPUs, which can perform at least an order-of-magnitude faster when Tensor Cores are customized by Nvidia to boost the matrix-multiplication efficiency. Moreover, to further improve the energy efficiency, DNN-specific computing units are developed, such as the tensor-processing units (TPUs) announced by Google. A TPU is designed for its own deep-learning framework (TensorFlow) and neural-processing units (NPUs), which “hardwarelizes” the architecture of the DNN on the chips. Because DNN computations in these highly specific customized configurations are not interrupted by other tasks, energy efficiency can be improved by several folds for the same computational tasks. Physical systems provide analog solutions to mimic the above-mentioned neural processing and computation, in which data transformation is dictated by physical mechanisms and requires negligible power consumption. This forms a sharp contrast to microchip systems, where digital and/or Boolean operations of high precision and low noise ensure computational accuracy. Deep learning is able to deal with data of low precision and high noise, which usually arises in physical systems. Therefore, noise can be conquered, to some extent, by introducing noise in the training phase. And, to be more rigorous, system stability can be physically strengthened by adding stabilizer into the packing of the system, though sophisticated engineering is necessary. Optics (or photonics), for example, is of particular interest since the light-matter interaction essentially companies Fourier transforms, matrix multiplication, convolution, and many other operations such as nonlinearity. 3 An optical system is therefore inherently empowered to fulfill the tasks in CV, such as image classification and transformation, at the speed of light. To achieve multiple convolutional kernels, an optical component, enabling modulation in amplitude and/or phase in multiple channels, needs to be embedded into the system. This criterion matches well with optical components, such as ground glass diffusers, diffractive layers, multimode fibers, and specifically designed metasurfaces. The weights of the DNN kernels represented by these components are fixed with certain physical configurations (e.g., customized three-dimensional [3D]-printed layers 4 ) based on the training simulated in a computer. Free-space optics therefore focuses on the inference phase of deep learning. With proper manipulation, an optical system becomes versatile in deep learning, with applications not limited to CV but also NLPs such as vowel classification. In addition to mimicking mathematical operations in DNNs by the physical system, it is also feasible to train the hardware’s physical transformation to match specific task/applications, such as ultrafast optical second-harmonic generation, multimode mechanical oscillations, and the analog dynamics of a nonlinear triode. Thanks to continuous development, the inference performance for the same task is approaching the state-of-art results provided by microchip systems. Another approach to extend the applications of the physical system lies in the training phase, which accounts for the rest of the energy cost in deep learning (∼10%). Physical systems are not commonly designed to speed up the training phase because their configurations are usually passively determined and the embedded DNN parameters are not trainable (or changeable) without re-configuration. The poor flexibility poses obstacles to fine-tune the physical networks for better performance and transfer the function of system to other applications. In neuromorphic computing, tunable units (e.g., phase-change components in photonics circuits) are designed to store the weights of the synapse, accompanied by control circuits. 5 Trainable parameters can also be encoded or combined with the input data; the physical systems here merely serve as an operator (e.g., adders, multipliers, or nonlinear transformations) actuated by physical mechanisms. More presentations of the physical systems are expected to inspire and incubate more AI-specific edge modules or smart sensors, especially for scenarios where low latency is necessary. The internet of things (IoT), like self-driving cars, smart home systems, etc., could be more beneficial as it inherently processes physical data from the environment and desires fast processing. Achieving trainable physical systems is not the end goal (Figure 1C). Microchips are inevitable accessories for analog-to-digital/digital-to-analog conversion (ADC/DAC), data processing, data storage, and workflow control. Heat dissipation and temporal latency of those electronics could loop back as the bottleneck of physical systems in deep learning. Seamless interactions between the physical mechanism and the accessories are expected, which necessitates specific design and sophisticated engineering for their integration. Furthermore, co-packing and miniaturization for them is also a critical challenge since not all physical systems can be engineered and manufactured in small size. These bulky systems limit the number of neurons in a single network and, hence, the network’s complexity. Currently, the largest number of neurons in a reported trainable physical system in a chip-based realization occurs in the photonics circuit (on the scale of 102), while the numbers in large DNNs, like ResNet101 and Unet, are on the scale of 105. Such restrictions are expected to be addressed so that the ability of AI-specific physical systems can be equivalent to the DNNs running on computers, which will enable considerably more complex tasks, such as image segmentation, image style transformation, language translation, etc. In summary, various accelerators have been developed to improve DNN computation regarding processing speed as well as energy efficiency. An overall tendency of the computation platforms has been seen, evolving from conventional electronics to physical systems, with progressive reductions in energy consumption. While there are limitations for current realizations of physical systems, vigorous development is strongly looked forward to to continuously improve energy efficiency with the desired performance.

Related collections

Most cited references 5

Record: found
Abstract: found
Article: not found

All-optical machine learning using diffractive deep neural networks

Mona Jarrahi, Yair Rivenson, Nezih. T. Yardimci … (2018)

Deep learning has been transforming our ability to execute advanced inference tasks using computers. We introduce a physical mechanism to perform machine learning by demonstrating an all-optical Diffractive Deep Neural Network (D2NN) architecture that can implement various functions following the deep learning-based design of passive diffractive layers that work collectively. We create 3D-printed D2NNs that implement classification of images of handwritten digits and fashion products as well as the function of an imaging lens at terahertz spectrum. Our all-optical deep learning framework can perform, at the speed of light, various complex functions that computer-based neural networks can implement, and will find applications in all-optical image analysis, feature detection and object classification, also enabling new camera designs and optical components that perform unique tasks using D2NNs.