Deep physical neural networks trained with backpropagation | Nature
Mục Lục
Physics-aware training
To train the PNNs presented in Figs. 2–4, we used PAT to enable us to perform backpropagation on the physical apparatuses as automatic differentiation (autodiff) functions within PyTorch54 (v1.6). We used PyTorch Lightning61 (v0.9) and Weights and Biases62 (v0.10) during development as well. PAT is explained in detail in Supplementary Section 1, where it is compared with standard backpropagation, and training physical devices in silico. Here we provide only an overview of PAT in the context of a generic multilayer PNN (Supplementary Figs. 2, 3).
PAT can be formalized by the use of custom constituent autodiff functions for the physically executed submodules in an overall network architecture (Supplementary Fig. 1). In PAT, each physical system’s forward functionality is provided by the system’s own controllable physical transformation, which can be thought of as a parameterized function \({f}_{{\rm{p}}}\) that relates the input x, parameters θ, and outputs y of the transformation via y = fp (x,θ). As a physical system cannot be auto-differentiated, we use a differentiable digital model \({f}_{{\rm{m}}}\) to approximate each backward pass through a given physical module. This structure is essentially a generalization of quantization-aware training48, in which low-precision neural network hardware is approximated by quantizing weights and activation values on the forward pass, but storing weights and activations, and performing the backward pass with full precision.
To see how this works, we consider here the specific case of a multilayer feedforward PNN with standard stochastic gradient descent. In this case, the PAT algorithm with the above-defined custom autodiff functions results in the following training loop:
Perform forward pass:
$${{\bf{x}}}^{[l+1]}={{\boldsymbol{y}}}^{[l]}={f}_{{\rm{p}}}({{\bf{x}}}^{[l]},{{\boldsymbol{\theta }}}^{[l]})$$
(1)
Compute (exact) error vector:
$${g}_{{{\bf{y}}}^{[N]}}=\frac{\partial L}{\partial {{\bf{y}}}^{[N]}}=\frac{\partial {\mathscr{l}}}{\partial {{\bf{y}}}^{\left[N\right]}}({{\bf{y}}}^{\left[N\right]},{{\bf{y}}}_{{\rm{target}}})$$
(2)
Perform backward pass
$${g}_{{{\bf{y}}}^{[l-1]}}={\left[\frac{{\rm{\partial }}{f}_{{\rm{m}}}}{{\rm{\partial }}{\bf{x}}}({{\bf{x}}}^{[l]},{{\boldsymbol{\theta }}}^{[l]})\right]}^{{\rm{T}}}{g}_{{{\bf{y}}}^{[l]}}$$
(3a)
$${g}_{{{\boldsymbol{\theta }}}^{\left[l-1\right]}}={\left[\frac{\partial {f}_{{\rm{m}}}}{\partial {\boldsymbol{\theta }}}({{\bf{x}}}^{\left[l\right]},{{\boldsymbol{\theta }}}^{\left[l\right]})\right]}^{{\rm{T}}}{g}_{{{\bf{y}}}^{\left[l\right]}}$$
(3b)
Update parameters:
$${{\boldsymbol{\theta }}}^{\left[l\right]}\to {{\boldsymbol{\theta }}}^{\left[l\right]}-\eta \frac{1}{{N}_{{\rm{data}}}}\sum _{k}{g}_{{{\boldsymbol{\theta }}}^{\left[l\right]}}^{(k)}$$
(4)
where \({g}_{{{\boldsymbol{\theta }}}^{\left[l\right]}}\) and \({g}_{{{\bf{y}}}^{\left[l\right]}}\) are estimators of the physical systems’ exact gradients, \(\frac{\partial L}{\partial {{\boldsymbol{\theta }}}^{[l]}}\) and \(\frac{\partial L}{\partial {{\bf{y}}}^{[l]}}\), respectively for the \([l]\)th layer, obtained by auto-differentiation of the model, \(L\) is the loss, \({\mathscr{l}}\) is the loss function (for example, cross-entropy or mean-squared error), \({{\bf{y}}}_{{\rm{target}}}\) is the desired (target) output, \({N}_{{\rm{data}}}\) is the size of the batch and \(\eta \) is the learning rate. \({{\bf{x}}}^{[l+1]}\) is the input vector to the \([l+1]\)th layer, which for the hidden layers of the feedforward architecture is equal to the output vector of the previous layer, \({{\bf{x}}}^{[l+1]}={{\bf{y}}}^{[l]}={f}_{{\rm{p}}}\left({{\bf{x}}}^{\left[l\right]},{{\boldsymbol{\theta }}}^{\left[l\right]}\right)\), where \({{\boldsymbol{\theta }}}^{[l]}\) is the controllable (trainable) parameter vector for the \([l]\)th layer. For the first layer, the input data vector \({{\bf{x}}}^{\left[1\right]}\) is the data to be operated on. In PAT, the error vector is exactly estimated (\({g}_{{{\bf{y}}}^{\left[N\right]}}=\frac{\partial L}{\partial {{\bf{y}}}^{[N]}}\)) as the forward pass is performed by the physical system. This error vector is then backpropagated via equation (3), which involves Jacobian matrices of the differential digital model evaluated at the correct inputs at each layer (that is, the actual physical inputs) \({\left[\frac{{\rm{\partial }}{f}_{{\rm{m}}}}{{\rm{\partial }}{\bf{x}}}({{\bf{x}}}^{[l]},{{\boldsymbol{\theta }}}^{[l]})\right]}^{{\rm{T}}}\), where T represents the transpose operation. Thus, in addition to utilizing the output of the PNN (\({{\bf{y}}}^{[N]}\)) via physical computations in the forward pass, intermediate outputs (\({{\bf{y}}}^{[l]}\)) are also utilized to facilitate the computation of accurate gradients in PAT.
As it is implemented just by defining a custom autodiff function, generalizing PAT for more complex architectures, such as multichannel or hybrid physical–digital models, with different loss functions and so on is straightforward. See Supplementary Section 1 for details.
An intuitive motivation for why PAT works is that the training’s optimization of parameters is always grounded in the true optimization landscape by the physical forward pass. With PAT, even if gradients are estimated only approximately, the true loss function is always precisely known. As long as the gradients estimated by the backward pass are reasonably accurate, optimization will proceed correctly. Although the required training time is expected to increase as the error in gradient estimation increases, in principle it is sufficient for the estimated gradient to be pointing closer to the direction of the true gradient than its opposite (that is, that the dot product of the estimated and true gradients is positive). Moreover, by using the physical system in the forward pass, the true output from each intermediate layer is also known, so gradients of intermediate physical layers are always computed with respect to correct inputs. In any form of in silico training, compounding errors build up through the imperfect simulation of each physical layer, leading to a rapidly diverging simulation–reality gap as training proceeds (see Supplementary Section 1 for details). As a secondary benefit, PAT ensures that learned models are inherently resilient to noise and other imperfections beyond a digital model, as the change of loss along noisy directions in parameter space will tend to average to zero. This makes training robust to, for example, device–device variations, and facilitates the learning of noise-resilient (and, more speculatively, noise-enhanced) models8.
Differentiable digital models
To perform PAT, a differentiable digital model of the physical system’s input–output transformation is required. Any model, \({f}_{{\rm{m}}}\), of the physical system’s true forward function, \({f}_{{\rm{p}}}\), can be used to perform PAT, so long as it can be auto-differentiated. Viable approaches include traditional physics models, black-box machine-learning models13,63,64 and physics-informed machine-learning65 models.
In this work, we used the black-box strategy for our differentiable digital models, namely DNNs trained on input–output vector pairs from the physical systems as \({f}_{{\rm{m}}}\) (except for the mechanical system). Two advantages of this approach are that it is fully general (it can be applied even to systems in which one has no underlying knowledge-based model of the system) and that the accuracy can be extremely high, at least for physical inputs, \(({\bf{x}},{\boldsymbol{\theta }})\), within the distribution of the training data (for out-of-distribution generalization, we expect physics-based approaches to offer advantages). In addition, the fact that each physical system has a precise corresponding DNN means that the resulting PNN can be analysed as a network of DNNs, which may be useful for explaining the PNN’s learned physical algorithm.
For our DNN differentiable digital models, we used a neural architecture search66 to optimize hyperparameters, including the learning rate, number of layers and number of hidden units in each layer. Typical optimal architectures involved 3–5 layers with 200–1,000 hidden units in each, trained using the Adam optimizer, mean-squared loss function and learning rates of around 10−4. For more details, see Supplementary Section 2D.1.
For the nonlinear optical system, the test accuracy of the trained digital model (Supplementary Fig. 20) shows that the model is remarkably accurate compared with typical simulation–experiment agreement in broadband nonlinear optics, especially considering that the pulses used exhibit a complex spatiotemporal structure owing to the pulse shaper. The model is not, however, an exact description of the physical system: the typical error for each element of the output vector is about 1–2%. For the analogue electronic circuit, agreement is also good, although worse than the other systems (Supplementary Fig. 23), corresponding to around 5–10% prediction error for each component of the output vector. For the mechanical system, we found that a linear model was sufficient to obtain excellent agreement, which resulted in a typical error of about 1% for each component of the output vector (Supplementary Fig. 26).
In silico training
To train PNNs in silico, we applied a training loop similar to the one described above for PAT except that both the forward and backward passes are performed using the model (Supplementary Figs. 1, 3), with one exception noted below.
To improve the performance of in silico training as much as possible and permit the fairest comparison with PAT, we also modelled the input-dependent noise of the physical system and used this within the forward pass of in silico training. To do this, we trained, for each physical system, an additional DNN to predict the eigenvectors of the output vector’s noise covariance matrix, as a function of the physical system’s input vector and parameter vector. These noise models thus provided an input- and parameter-dependent estimate of the distribution of noise in the output vector produced by the physical system. We were able to achieve excellent agreement between the noise models’ predicted noise distributions and experimental measurements (Supplementary Figs. 18, 19). We found that including this noise model improved the performance of experiments performed using parameters derived from in silico training. Consequently, all in silico training results presented in this paper make use of such a model, except for the mechanical system, where a simpler, uniform noise model was found to be sufficient. For additional details, see Supplementary Section 2D.2.
Although including complex, accurate noise models does not allow in silico training to perform as well as PAT, we recommend that such models be used whenever in silico training is performed, such as for physical architecture search and design and possibly pre-training (Supplementary Section 5), as the correspondence with experiment (and, in particular, the predicted peak accuracy achievable there) is significantly improved over simpler noise models, or when ignoring physical noise.
Ultrafast nonlinear optical pulse propagation experiments
For experiments with ultrafast nonlinear pulse propagation in quadratic nonlinear media (Supplementary Figs. 8–10), we shaped pulses from a mode-locked titanium:sapphire laser (Spectra Physics Tsunami, centred around 780 nm and pulse duration around 100 fs) using a custom pulse shaper. Our optical pulse shaper used a digital micromirror device (DMD, Vialux V-650L) and was inspired by the design in ref. 67. Despite the binary modulations of the individual mirrors, we were able to achieve multilevel spectral amplitude modulation by varying the duty cycle of gratings written to the DMD along the dimension orthogonal to the diffraction of the pulse frequencies. To control the DMD, we adapted code developed for ref. 68, which is available at ref. 69.
After being shaped by the pulse shaper, the femtosecond pulses were focused into a 0.5-mm-long beta-barium borate crystal. The multitude of frequencies within the broadband pulses then undergo various nonlinear optical processes, including sum-frequency generation and SHG. The pulse shaper imparts a complex phase and spatiotemporal structure on the pulse, which depend on the input and parameters applied through the spectral modulations. These features would make it impossible to accurately model the experiment using a one-dimensional pulse propagation model. For simplicity, we refer to this complex, spatiotemporal quadratic nonlinear pulse propagation as ultrafast SHG.
Although the functionality of the SHG-PNN does not rely on a closed-form mathematical description or indeed on any form of mathematical isomorphism, some readers may find it helpful to understand the approximate form of the input–output transformation realized in this experimental apparatus. We emphasize that the following model is idealistic and meant to convey key intuitions about the physical transformation: the model does not describe the experimental transformation in a quantitative manner, owing to the numerous experimental complexities described above.
The physical transformation of the ultrafast SHG setup is seeded by the infrared light from the titanium:sapphire laser. This ultrashort pulse can be described by the Fourier transform of the electric field envelope of the pulse, \({A}_{0}(\omega )\), where ω is the frequency of the field detuned relative to the carrier frequency. For simplicity, consider a pulse consisting of a set of discrete frequencies or frequency bins, whose spectral amplitudes are described by the discrete vector \({{\bf{A}}}_{{\bf{0}}}={{[A}_{0}({\omega }_{1}),{A}_{0}({\omega }_{2}),\ldots ,{A}_{0}({\omega }_{N})]}^{{\rm{T}}}\,.\) After passing through the pulseshaper, the spectral amplitudes of the pulse are then given by
$${\bf{A}}={{[\sqrt{{x}_{1}}A}_{0}({\omega }_{1}),{\sqrt{{x}_{2}}A}_{0}({\omega }_{2}),\ldots ,{\sqrt{{\theta }_{1}}A}_{0}({\omega }_{{N}_{x}+1}),{\sqrt{{\theta }_{2}}A}_{0}({\omega }_{{N}_{x}+2}),\ldots ]}^{{\rm{T}}},$$
(5)
where \({N}_{x}\) is the dimensionality of the data vector, \({\theta }_{i}\) are the trainable pulse-shaper amplitudes and \({x}_{i}\) are the elements of the input data vector. Thus, the output from the pulse shaper encodes both the machine-learning data as well as the trainable parameters. Square roots are present in equation (5) because the pulse shaper was deliberately calibrated to perform an intensity modulation.
The output from the pulse shaper (equation (5)) is then input to the ultrafast SHG process. The propagation of an ultrashort pulse through a quadratic nonlinear medium results in an input–output transformation that roughly approximates an autocorrelation, or nonlinear convolution, assuming that the dispersion during propagation is small and the input pulse is well described by a single spatial mode. In this limit, the output blue spectrum \(B\left({\omega }_{i}\right)\) is mathematically given by
$$B({\omega }_{i})=k\sum _{j}A({\omega }_{i}+{\omega }_{j})A({\omega }_{i}-{\omega }_{j}),$$
(6)
where the sum is over all frequency bins j of the pulsed field. The output of the trainable physical transformation \({\bf{y}}={f}_{{\rm{p}}}\left({\bf{x}},{\boldsymbol{\theta }}\right)\,\)is given by the blue pulse’s spectral power, \({{\bf{y}}=[{|{B}_{{\omega }_{1}}|}^{2},{|{B}_{{\omega }_{2}}|}^{2},\ldots ,{|{B}_{{\omega }_{N}}|}^{2}]}^{{\rm{T}}}\,,\)where \({N}\) is the length of the output vector.
From this description, it is clear that the physical transformation realized by the ultrafast SHG process is not isomorphic to any conventional neural network layer, even in this idealized limit. Nonetheless, the physical transformation retains some key features of typical neural network layers. First, the physical transformation is nonlinear as the SHG process involves the squaring of the input field. Second, as the terms within the summation in equation (6) involve both parameters and input data, the transformation also mixes the different elements of the input data and parameters to product an output. This mixing of input elements is similar, but not necessarily directly mathematically equivalent to, the mixing of input vector elements that occur in the matrix-vector multiplications or convolutions that appear in conventional neural networks.
Vowel classification with ultrafast SHG
A task often used to demonstrate novel machine-learning hardware is the classification of spoken vowels according to formant frequencies10,11. The task involves predicting the spoken vowels given a 12-dimensional input data vector of formant frequencies extracted from audio recordings10. Here we use the vowel dataset from ref. 10, which is based on data originally from ref. 70; data available at https://homepages.wmich.edu/~hillenbr/voweldata.html. This dataset consists of 273 data input–output pairs. We used 175 data pairs as the training set—49 for the validation and 49 for the test set. For the results in Figs. 2, 3, we optimized for the hyperparameters of the PNN architecture using the validation error and only evaluated the test error after all optimization was conducted. In Fig. 3c, for each PNN with a given number of layers, the experiment was conducted with two different training, validation and test splits of the vowel data. In Fig. 3c, the line plots the mean over the two splits, and the error bars are the standard error of the mean.
For the vowel-classification PNN presented in Figs. 2, 3, the input vector to each SHG physical layer is encoded in a contiguous short-wavelength section of the spectral modulation vector sent to the pulse shaper, and the trainable parameters are encoded in the spectral modulations applied to the rest of the spectrum. For the physical layers after the first layer, the input vector to the physical system is the measured spectrum obtained from the previous layer. For convenience, we performed digital renormalization of these output vectors to maximize the dynamic range of the input and ensure that inputs were within the allowed range of 0 to 1 accepted by the pulse shaper. Relatedly, we found that training stability was improved by including additional trainable digital re-scaling parameters to the forward-fed vector, allowing the overall bias and amplitude scale of the physical inputs to each layer to be adjusted during training. These digital parameters appear to have a negligible role in the final trained PNN (when the physical transformations are replaced by identity operations, the network can be trained to perform no better than chance, and the final trained values of the scale and bias parameters are all very close to 1 and 0, respectively). We hypothesize that these trainable rescaling parameters are helpful during training to allow the network to escape noise-affected subspaces of parameter space. See Supplementary Section 2E.1 for details.
The vowel-classification SHG-PNN architecture (Supplementary Fig. 21) was designed to be as simple as possible while still demonstrating the use of a multilayer architecture with a physical transformation that is not isomorphic to a conventional DNN layer, and so that the computations involved in performing the classification were essentially all performed by the physical system itself. Many aspects of the design are not optimal with respect to performance, so design choices, such as our specific choice to partition input data and parameter vectors into the controllable parameters of the experiment, should not be interpreted as representing any systematic optimization. Similarly, the vowel-classification task was chosen as a simple example of multidimensional machine-learning classification. As this task can be solved almost perfectly by a linear model, it is in fact poorly suited to the nonlinear optical transformations of our SHG-PNN, which are fully nonlinear (Supplementary Figs. 9, 10). Overall, readers should not interpret this PNN’s design as suggestive of optimal design strategies for PNNs. For initial guidelines on optimal design strategies, we instead refer readers to Supplementary Section 5.
MNIST handwritten digit image classification with a hybrid physical–digital SHG-PNN
The design of the hybrid physical–digital MNIST PNN based on ultrafast SHG for handwritten digit classification (Fig. 4i–l) was chosen to demonstrate a proof-of-concept PNN in which substantial digital operations were co-trained with substantial physical transformations, and in which no digital output layer was used (although a digital output layer can be used with PNNs, and we expect such a layer will usually improve performance, we wanted to avoid confusing readers familiar with reservoir computing, and so avoided using digital output layers in this work).
The network (Supplementary Fig. 29) involves four trainable linear input layers that operate on MNIST digit images, whose outputs are fed into four separate channels in which the SHG physical transformation is used twice in succession (that is, it is two physical layers deep). The output of the final layers of each channel (the final SHG spectra) are concatenated, then summed into ten bins to perform a classification. The structure of the input layer was chosen to minimize the complexity of inputs to the pulse shaper. We found that the output second-harmonic spectra produced by the nonlinear optical process tended towards featureless triangular spectra if inputs were close to a random uniform distribution. Thus, to ensure that output spectra varied significantly with respect to changes in the input spectral modulations, we made sure that inputs to the pulse shaper would exhibit a smoother structure in the following way. For each of 4 independent channels, 196-dimensional input images (downsampled from 784-dimensional 28 × 28 images) are first operated on by a 196 by 50 trainable linear matrix, and then (without any nonlinear digital operations), a second 50 by 196 trainable linear matrix. The second 50 by 196 matrix is identical for all channels, the intent being that this matrix identifies optimal ‘input modes’ to the SHG process. By varying the middle dimension of this two-step linear input layer, one may control the amount of structure (number of ‘spectral modes’) allowed in inputs to the pulse shaper, as the middle dimension effectively controls the rank of the total linear matrix. We found that a middle dimension below 30 resulted in the most visually varied SHG output spectra, but that 50 was sufficient for good performance on the MNIST task. In this network, we also utilized skip connections between layers in each channel. This was done so that the network would be able to ‘choose’ to use the linear digital operations to perform the linear part of the classification task (for which nearly 90% accuracy can be obtained55) and to thus rely on the SHG co-processor primarily for the harder, nonlinear part of the classification task. Between the physical layers in each channel, a trainable, element-wise rescaling was used to allow us to train the second physical layer transformations efficiently. That is, \({x}_{i}={a}_{i}{y}_{i}+{b}_{i}\), where \({b}_{i}\) and \({a}_{i}\) are trainable parameters, and \({x}_{i}\) and \({y}_{i}\) are the input to the pulse shaper and the measured output spectrum from the previous physical layer, respectively.
For further details on the nonlinear optical experimental setup and its characterization, we refer readers to Supplementary Section 2A. For further details on the vowel-classification SHG-PNN, we refer readers to Supplementary Section 2E.1, and for the hybrid physical–digital MNIST handwritten digit-classification SHG-PNN, we refer readers to Supplementary Section 2E.4.
Analogue electronic circuit experiments
The electronic circuit used for our experiments (Supplementary Fig. 11) was a resistor-inductor-capacitor oscillator (RLC oscillator) with a transistor embedded within it. It was designed to produce as nonlinear and complex a response as possible, while still containing only a few simple components (Supplementary Figs. 12, 13). The experiments were carried out with standard bulk electronic components, a hobbyist circuit breadboard and a USB data acquisition (DAQ) device (Measurement Computing USB-1208-HS-4AO), which allowed for one analogue input and one analogue output channel, with a sampling rate of 1 MS s−1.
The electronic circuit provides only a one-dimensional time-series input and one-dimensional time-series output. As a result, to partition the inputs to the system into trainable parameters and input data so that we could control the circuit’s transformation of input data, we found it was most convenient to apply parameters to the one-dimensional input time-series vector by performing trainable, element-wise rescaling on the input time-series vector. That is, \({x}_{i}={a}_{i}{y}_{i}+{b}_{i}\), where \({b}_{i}\) and \({a}_{i}\) are trainable parameters, \({y}_{i}\) are the components of the input data vector and\(\,{x}_{i}\) are the re-scaled components of the voltage time series that is then sent to the analogue circuit. For the first layer, \({y}_{i}\) are the unrolled pixels of the input MNIST image. For hidden layers, \({y}_{i}\) are the components of the output voltage time-series vector from the previous layer.
We found that the electronic circuit’s output was noisy, primarily owing to the timing jitter noise that resulted from operating the DAQ at its maximum sampling rate (Supplementary Fig. 23). Rather than reducing this noise by operating the device more slowly, we were motivated to design the PNN architecture presented in Fig. 4 in a way that allowed it to automatically learn to function robustly and accurately, even in the presence of up to 20% noise per output vector element (See Supplementary Fig. 24 for an expanded depiction of the architecture). First, seven, three-layer feedforward PNNs were trained together, with the final prediction provided by averaging the output of all seven, three-layer PNNs. Second, skip connections similar to those used in residual neural networks were employed71. These measures make the output of the network effectively an ensemble average over many different subnetworks71, which allows it to perform accurately and train smoothly despite the very high physical noise and multilayer design.
For further details on the analogue electronic experimental setup and its characterization, we refer readers to Supplementary Section 2B. For further details on the MNIST handwritten digit-classification analogue electronic PNN, we refer readers to Supplementary Section 2E.2.
Oscillating mechanical plate experiments
The mechanical plate oscillator was constructed by attaching a 3.2 cm by 3.2 cm by 1 mm titanium plate to a long, centre-mounted screw, which was fixed to the voice coil of a commercial full-range speaker (Supplementary Figs. 14, 15). The speaker was driven by an audio amplifier (Kinter K2020A+) and the oscillations of the plate were recorded using a microphone (Audio-Technica ATR2100x-USB Cardioid Dynamic Microphone). The diaphragm of the speaker was completely removed so that the sound recorded by the microphone is produced only by the oscillating metal plate.
As the physical input (output) to (from) the mechanical oscillator is a one-dimensional time series, similar to the electronic circuit, we made use of element-wise trainable rescaling to conveniently allow us to train the oscillating plate’s physical transformations.
The mechanical PNN architecture for the MNIST handwritten digit classification task was chosen to be the simplest multilayer PNN architecture possible with such a one-dimensional dynamical system (Supplementary Fig. 27). As the mechanical plate’s input–output responses are primarily linear convolutions (Supplementary Figs. 16, 17), it is well suited to the MNIST handwritten digit classification task, achieving nearly the same performance as a digital linear model55.
For further details on the oscillating mechanical plate experimental setup and its characterization, we refer readers to Supplementary Section 2C. For further details on the MNIST handwritten digit-classification oscillating mechanical plate PNN, we refer readers to Supplementary Section 2E.3.