Fluctuation-driven initialization for spiking neural network training - IOPscience

Here, we close this gap by introducing a practical weight initialization strategy for SNNs. Specifically, we draw inspiration from neurobiology, where neuronal dynamics commonly exhibit fluctuation-driven firing [ 16 , 17 ]. Since neurons in the fluctuation-driven regime are more sensitive to small changes in the input [ 18 ] and thus also to changes in their synaptic weights, we hypothesized that this regime could be advantageous for subsequent SG learning. In the following, we develop a general, yet simple initialization theory for SNNs consisting of integrate-and-fire (LIF) neurons, and empirically demonstrate its effectiveness for task-optimizing SNNs using SG techniques.

Spiking neurons communicate through discrete action potentials, or spikes, thereby enabling energy efficient and reliable information processing in neurobiological and neuromorphic systems [ 1 , 2 ]. Before using a spiking neural network (SNN) for any application, its connections need to be task-optimized. In conventional artificial neural networks (ANNs) this step is accomplished through direct end-to-end optimization using back-propagation in combination with suitable parameter initialization [ 3 ]. However, the lack of smooth derivatives of neuronal spiking dynamics precludes using gradient-based optimization in SNNs. One increasingly common approach to overcome this issue is surrogate gradient (SG) learning [ 4 – 6 ] which relies on continuous relaxations of the actual gradients for parameter updates. While SGs are a powerful tool for building functional SNN models, they can be adversely affected by poor initial parameter choices. In deep ANNs, suboptimal weight initialization can lead to vanishing or exploding gradients [ 7 – 9 ], thereby creating a major impediment to their use. Optimal weight initialization [ 10 – 12 ] combined with suitable architectural choices such as skip connections [ 11 , 13 ] largely avoid this issue in ANNs. Similarly, the problem of vanishing gradients has been suggested to affect deep SNNs [ 14 , 15 ]. However, we still lack a principled strategy for SNN initialization.

Neurons in biological SNNs commonly exhibit irregular and asynchronous firing dynamics [16, 17, 19]. Such dynamics can often be attributed to large sub-threshold fluctuations that can naturally arise through excitatory–inhibitory balance commonly observed in neurobiology [19, 20]. To test whether this fluctuation-driven regime could constitute a suitable initial state for subsequent learning, we proceeded in two steps. First, we derived a set of compact analytical expressions that link the initial synaptic weight distribution with the magnitude of sub-threshold fluctuations. Second, we numerically tested whether initializing SNNs in the fluctuation-driven regime would allow us to rapidly train these networks to high accuracy using SGs.

To arrive at analytical expressions, we note that there are primarily three factors that contribute to the membrane potential fluctuations (figure 1(a)). These are, first, the number and firing statistics of the input neurons, second, the synaptic weight distribution, and third, the postsynaptic and neuronal parameters that govern temporal integration of the inputs. For simplicity, we assume that the presynaptic input arrives from a homogeneous population of independent Poisson neurons and that the initial weight distribution is given by a Gaussian. Further, we limited our derivation to current-based LIF neurons, which are commonly used in SNN models.

Figure 1. Parameterization of fluctuation-driven spiking serves as an initialization strategy for SNNs. (a) Incoming presynaptic Poisson spike trains (i) are weighted by synaptic strengths wj
and filtered through a PSP kernel (t) (ii) to yield membrane fluctuations u(t) in a postsynaptic neuron (iii). In the fluctuation-driven regime, the membrane potential crosses the firing threshold θ stochastically, resulting in irregular output spike trains. Because the magnitude of membrane potential fluctuations, σU
, is determined by the parameters of the presynaptic weight distribution, μW
and σW
, synaptic weights can be initialized from a target value for the fluctuation magnitude. (b) Expected and observed distributions of the membrane potential without considering spike-reset dynamics for different target fluctuation strengths expressed in terms of σU
and ξ. (c) As panel (b), but considering the spike reset dynamics in the numerical simulations.

Download figure:

Standard image

High-resolution image

To derive an expression that links the synaptic weight distribution to the fluctuation magnitude, we consider a current-based LIF neuron with membrane potential U, whose sub-threshold dynamics are given as the sum of weighted filtered presynaptic spike trains Sj
:

where ${S}_{j}={\sum }_{k}\delta (t-{t}_{j}^{k})$ denotes the output spike train of the presynaptic neuron j with firing times ${t}_{j}^{k}$ and * is a temporal convolution of the spike train Sj
(t) with , a linear filter kernel with the shape of an evoked postsynaptic potential (PSP). Specifically, we assume a synaptic model with exponentially decaying currents and, therefore, the shape of is fully characterized by the synaptic and membrane time constants τsyn and τmem (see methods and supplementary material S1). Since we have many statistically independent inputs, the central limit theorem guarantees that U approaches a normal distribution which is fully specified by its mean μU
and variance ${\sigma }_{U}^{2}$ . Further assuming that presynaptic spikes are generated by homogeneous Poisson processes with associated firing rates ${\nu }_{j}=\left\langle {S}_{j}\right\rangle$ , yields the following expressions for the mean and the variance

in which $\bar{{\epsilon}}$ and $\hat{{\epsilon}}$ correspond to definite integrals of the filter kernel and squared filter kernel respectively which can be obtained analytically for many common neuron models (supplementary material S1). For n inputs with equal firing rates νj
= ν and independently drawn normally distributed weights $W\sim \mathcal{N}\left({\mu }_{W},{\sigma }_{W}^{2}\right)$ , the above expressions further simplify to

Finally, rewriting equations (4) and (5) yields the desired expressions linking the synaptic weight distribution with the magnitude of the membrane potential fluctuations:

For a neuron to be in the fluctuation-driven regime requires the bulk of the Gaussian distribution to lie below the firing threshold (figure 1(b)). At the same time, we require a non-vanishing probability to cross the threshold to ensure some baseline levels of spiking activity. To formalize these requirements, we introduced the target parameter ξ as

which describes the distance between the mean membrane potential μU
and the spike threshold θ in units of the standard deviation σU
(figure 1(a) and supplementary figure S1). To satisfy the above requirements, ξ should be on the order of one. Concretely, we consider the range 1 ⩽ ξ ⩽ 3 (figure 1(b)). For zero-mean weight distributions, this directly translates into a desired fluctuation amplitude range $\frac{1}{3}\leqslant {\sigma }_{U}\leqslant 1$ , which, given the above assumptions, is achieved by

at initialization.

The expressions are based on a no-spiking assumption. Hence, we expect systematic deviations from the derived membrane potential distribution in the presence of spiking. However, for small sub-threshold fluctuations (σU
≪ 1), the systematic contribution of the spike reset becomes negligible (figure 1(c)). Exact membrane potential distributions that take into consideration spike reset dynamics could be obtained using the Fokker–Planck equation [21, 22], however, such an approach does not yield compact analytic expressions and is, thus, less practical for our purposes.

Because equations (4) and (5) are based on an independence assumption that is violated by real-world data, we expected further deviations in numerical simulations with real-world data. To quantify the magnitude of these deviations, we compared the predictions of equations (4) and (5) with observed membrane potential fluctuations in a single LIF neuron exposed to inputs from two realistic datasets. For simplicity, we assumed a zero-mean weight distribution and used equation (9) to obtain its standard deviation for different target fluctuation magnitudes σU
.

First, we considered a synthetic classification dataset based on random manifolds that can flexibly generate SNN benchmarks of arbitrary complexity [5] (Randman; see methods). We generated a dataset with nRandman = 20 input neurons and 10 classes in which spike times belonging to the same class are drawn from a smooth random manifold (figure 2(a)) all the while different classes correspond to different manifolds. For each input pattern, each neuron fires precisely one spike during a 100 ms interval. Each 100 ms input interval was followed by 100 ms of inactivity in the input layer to allow for a propagation delay in the hidden layer (figure 2(a)). We then recorded the membrane potential distribution and found, as expected, that it deviated from a Gaussian (figure 2(b)), due to the temporal non-stationarity and structure. Next, we measured the observed membrane potential fluctuations ${\hat{\sigma }}_{U}$ for varying target values of σU
(figure 2(c)). We found that ${\hat{\sigma }}_{U}$ was systematically smaller than σU
. However, the magnitude of bias was comparable to the expected variability in the case of Poisson inputs (see supplementary material S2).

Figure 2. Real-world datasets induce small systematic biases in fluctuation strength at initialization. (a) Two one-dimensional example manifolds from the Randman dataset, embedded into a three-dimensional space (left) and example spike raster plots corresponding to a sample from each class (right). (b) Theoretically expected distribution and numerically obtained density histogram of the membrane potential of a single neuron without spike reset in response to the Randman dataset. Because of large peaks at u(t) = 0, the x-axes in the first and middle panels have been truncated to 45% and 80% of their maximum, respectively. (c) Numerically observed ${\hat{\sigma }}_{U}$ as a function of the target σU
for the Randman dataset. The expected relationship corresponds to homogeneous and independent Poisson neurons. Shaded regions indicate standard deviation across neurons. (d) Two spike rasters that correspond to two example inputs from the SHD dataset. Input spikes are obtained by filtering recordings of spoken digits with a biologically inspired cochlear model [23]. (e) As panel (b), for the SHD dataset. X-axes in the first and middle panels have been truncated to 58% and 90% of their maximum, respectively. (f) As panel (c), for the SHD dataset.

Download figure:

Standard image

High-resolution image

Next, we considered the Spiking Heidelberg Digits (SHD) speech dataset (figure 2(d)), an SNN benchmark based on real-world auditory data, which consists of approximately 10 000 spoken digits in German and English that have been converted into spikes using a biologically plausible cochlear model [23]. Importantly, SHD has a larger number of input neurons (nSHD = 700) which typically fire more than one spike with an average input firing rate of νSHD = 15.8 Hz. Again, we measured the membrane potential distribution and observed deviations from a Gaussian (figure 2(e)). In contrast to the Randman data, the observed fluctuations ${\hat{\sigma }}_{U}$ were systematically larger than their target σU
due to heavy tails in the distribution (figure 2(f)). Not surprisingly, real-world data causes systematic deviations from equations (4) and (5), but these differences were on the same order as expected fluctuations due to the finite sample size of the weight and Poisson variability. Hence, we reasoned that our simple theory provides a reasonable approximation for initializing SNNs in the fluctuation-driven regime even when using real-world data.

2.1. Initialization of shallow SNNs

We sought to evaluate whether the fluctuation-driven regime constitutes a good initialization strategy for SNN training. To this end, we trained a fully connected SNN with one hidden layer with 128 units on the Randman dataset (see table 4; methods). We initialized the weights using the parameters μW
= 0 and σW
given by our theory (equation (9) with target σU
= 1). This choice resulted in asynchronous irregular firing activity consistent with the fluctuation-driven regime (supplementary figure S2(a-d)). Subsequently, we trained the network in a supervised fashion using SGs with previously established parameters [5], back-propagation through time (BPTT), a maximum-over-time loss defined on ten readout units, and weak spiking activity regularization in the form of a soft upper bound on the population firing rate at the hidden layer (figure 3(a); methods). Training resulted in an SNN that accurately solved the task (test accuracy: 97.3% ± 0.2; train accuracy: 99.6% ± 0.0; figure 3(b)).

Figure 3. Initialization in the fluctuation-driven regime results in optimal learning performance. (a) Top: schematic of the SNN used for training. Bottom: illustration of the learning dynamics. The supervised loss function ${\mathcal{L}}_{\mathrm{sup}}$ relies on the maximum membrane potential over time of readout units ${U}_{i}^{(\text{out})}[t]$ , to which a Softmax and cross-entropy loss ${\mathcal{L}}_{\text{CE}}$ is applied. All networks were trained by minimizing ${\mathcal{L}}_{\mathrm{sup}}$ in the direction of negative SGs, computed with BPTT. (b) Snapshot of network activity over time after training on the Randman dataset. Bottom: spike raster of input layer activity from two different samples corresponding to two different classes is shown. Middle: spike raster of hidden layer activity. Top: membrane potential of readout units. The readout units corresponding to the two input classes are highlighted in different shades. (c) Heatmap showing validation accuracy after training on the Randman dataset as a function of the parameters of the synaptic weight distribution at initialization. (d) Same as in panel (b), but for a network trained on the SHD dataset. (e) Same as panel (c), but for the SHD dataset. (f) Validation accuracy as a function of target fluctuation magnitude σU
for initializations in the balanced state with μU
= μW
= 0. The shaded region around the lines indicates the range of values across five random seeds. The sand-colored shaded region corresponds to our suggested target fluctuation magnitude $\frac{1}{3}\leqslant {\sigma }_{U}\leqslant 1$ . (g) Average hidden layer firing rate as a function of σU
. (h) Average magnitude of SGs with respect to the output in the readout layer (top) and hidden layer (bottom) as a function of σU
. (i) Same as panel (h), but for the average magnitude of SGs with respect to the synaptic weights.

Download figure:

Standard image

High-resolution image

To test whether our weight initialization strategy confers an advantage over other choices of μW
and σW
, we performed an extensive parameter search and measured validation accuracy after 200 training epochs. The network achieved the best validation accuracy when μW
was zero or negative and σW
was close to one (figure 3(c)), well within our suggested regime of 1 ⩽ ξ ⩽ 3. Further, we found a large parameter regime that supported learning at close-to-optimal accuracy for −2 ⩽ μW
⩽ 0 and σW
< 10 which extends beyond the parameter regime suggested by our theory.

To test whether these results would change on a more complex task, we trained a similar SNN on the SHD dataset [23] with weight parameters μW
= 0 and ${\sigma }_{W}^{(\mathrm{S}\mathrm{H}\mathrm{D})}=0.23$ as suggested by our theory (equation (9) with target σU
= 1). Due to differences of the number of input neurons and firing rates between the two datasets our theory predicts ${\sigma }_{W}^{(\mathrm{S}\mathrm{H}\mathrm{D})}\approx 1{0}^{-1}{\sigma }_{W}^{(\mathrm{R}\mathrm{a}\mathrm{n}\mathrm{d}\mathrm{m}\mathrm{a}\mathrm{n})}$ . After training, the network accurately classified spoken digits (test accuracy: 65.5% ± 0.7; train accuracy: 100.0% ± 0.0; figure 3(d)). As before, we performed an extensive parameter search over different initializations and found that networks initialized in the fluctuation-driven regime (1 ⩽ ξ ⩽ 3) showed close-to-optimal performance (figures 3(e) and (f)). Unlike in the Randman case, the parameter regime with good performance was much smaller and tightly constrained around μW
≈ 0. Finally, even though our initialization strategy posits that neurons be in the fluctuation-driven regime, we observed a sizeable fraction of hidden layer neurons with regular firing activity both before (supplementary figure S2e) and after learning (figure 3(d)). We found that our theory predicts these cases (supplementary figures S2(d) and (h)) due to the inherent variability in the sampling of synaptic weights (supplementary material S2 and supplementary figure S3).

For both datasets, we found that initialization with σW
≪ 1 and μW
≈ 0 supported close-to-optimal learning. This result surprised us because the ensuing vanishing membrane potential fluctuations should lead to quiescent hidden layer activity. To check whether this is indeed the case, we initialized networks with different target values for σU
and recorded their hidden layer activity. As expected, we found that fluctuation magnitudes σU
≪ 1 still supported close-to-optimal learning performance (figure 3(f)), despite an absence of spikes in the hidden layer at the time of initialization (figure 3(g)).

Because vanishing spiking activity should influence gradient magnitudes during backpropagation, we recorded the magnitude of the SG with respect to the output at the readout and hidden layers at the time of the first training epoch. Due to the nature of the loss function, initialization does not affect the magnitude of the gradient in the readout layer but can change the magnitude of the gradient by two orders of magnitude in the hidden layer (figure 3(h)). Consequently, the absolute magnitude of weight changes is also amplified in the hidden layer when fluctuations are large (figure 3(i)). Since the synaptic weight update depends on presynaptic activity, initializations resulting in quiescent hidden layers (figure 3(g)) lead to an absence of weight updates in the readout layer (figure 3(i)). However, as long as SGs do not vanish in the first layer, the network can recover spike propagation and therefore gradient flow during training. That the network is able to learn without problems in this regime may seem surprising at first and is indeed a peculiarity of SGs.

In addition to classification accuracy, the sparsity of neuronal activity is a key SNN performance indicator. To limit firing rates in the hidden layers to a sensible regime, we optimized all networks with activity regularization. Specifically, we added a soft upper bound on the population firing rate (see methods). This regularization punishes population firing rates in the hidden layers exceeding 10 Hz. To investigate the effect of weight initialization on sparsity, we systematically recorded population firing rates of the above network trained with or without activity regularization. As expected, activity regularization resulted in average population firing rates of Hz following training, independent of the target σU
at initialization. In contrast, networks trained without activity regularization exhibited population firing rates exceeding 60 Hz in the hidden layer and only weak dependence on the target σU
at initialization (supplementary figures S4(a) and (b)). Next, we wanted to ensure that activity regularization does not result in a substantial loss in classification accuracy. To that end, we compared the accuracy of networks trained with and without activity regularization for the given threshold and strength parameters. We found that regularized networks performed only slightly worse than their unregularized counterparts albeit with vastly reduced average firing rates (supplementary figure S4(c)). Based on these findings, we used activity regularization on the population firing rates in all subsequent experiments.

Thus far, we studied strictly feed-forward SNNs without recurrent hidden layer connections. Recurrent SNNs typically perform better than feed-forward networks on tasks requiring memory such as SHD [5]. To that end, we extended our initialization strategy to networks with recurrent connections (see methods) and applied it to recurrent SNNs with one hidden layer. As in the case of feed-forward networks, we found recurrent SNNs trained well with sufficiently small target fluctuations σU
(supplementary figures S5(a) and (b)). In summary, shallow SNNs are surprisingly robust to initialization when the absolute magnitude of the weights is small. In practice, initialization with μU
= 0 and a target fluctuation magnitude σU
⩽ 1 can be used to achieve close-to-optimal learning performance.

2.2. Initialization of deep SNNs

We hypothesized that deep SNNs are more sensitive to initialization, as is the case with deep ANNs [11]. To test this hypothesis, we first extended our initialization strategy to deep and recurrent convolutional spiking neural network (CSNN) architectures (see methods). We then initialized several CSNN architectures with different numbers of recurrently connected hidden layers according to equations (37)–(39) with target μU
= 0 and different targets σU
. Subsequently, we trained the resulting networks and measured validation accuracy on held-out data. As expected, sensitivity to the fluctuation magnitude at initialization increased with network depth (figure 4(a)). As in shallow fully connected networks, CSNNs with a single hidden layer were remarkably robust to initialization and close-to-optimal training performance was achieved for σU
⩽ 1. In contrast, deep networks with three hidden layers performed well when the fluctuation magnitude fell into the regime 0.05 ⩽ σU
⩽ 3. This regime was narrowed further in deeper networks with seven hidden layers, which only achieved high validation accuracy for initializations in the range 0.5 ⩽ σU
⩽ 2. Like in the shallow case, activity regularization ensured sparse activity with a negligible effect on classification accuracy (supplementary figures S4(d) and (e)). Finally, although a seven-layer CSNN did not improve classification performance on this task over the three-layer network, we wanted to know whether initialization with σU
= 1 would be conducive for training even deeper networks. To get at this question, we extended our network to ten hidden layers, the deepest possible architecture afforded by our GPU memory while keeping all other training parameters equal to networks with seven layers, and found that initialization with σU
= 1 resulted in reliable training (test accuracy: 81.1% ± 1.6; validation accuracy: 93.6% ± 2.9). Crucially, when we instead trained with Kaiming initialization [11], the standard initialization for non-spiking rectified linear unit (ReLU) networks, learning failed in CSNNs with seven or more hidden layers. In summary, we observed that fluctuation-driven initialization with σU
= 1 supports learning in deep CSNNs.

Figure 4. Deep CSNNs are sensitive to initialization due to vanishing SGs. (a) Validation accuracy as a function of target fluctuation strength σU
for recurrent CSNNs of increasing depth. All networks were trained on the SHD dataset. The triangular markers in the right plot correspond to the values of σU
plotted in panels (d)–(f). The shaded region around the lines indicates the range of values across five random seeds. The sand-colored shaded region corresponds to our suggested target fluctuation magnitude $\frac{1}{3}\leqslant {\sigma }_{U}\leqslant 1$ . The dashed line corresponds to Kaiming initialization. (b) Test error of the five best-performing models in terms of validation accuracy, for different numbers of hidden layers and for networks with and without recurrent connections in the hidden layers. *No initialization parameter sweeps were performed for networks with ten hidden layers. Instead, the data depict results obtained from five networks initialized with target σU
= 1. (c) Training speed of CSNNs, as illustrated by the number of required epochs to reach 90% training accuracy on the SHD dataset. (d) Population firing rate at initialization (before training) as a function of hidden layers in a CSNN with seven hidden layers, for different values of σU
. (e) As panel (d), but displaying the magnitude of SGs. (f) As panel (d), but displaying the magnitude of the synaptic weight update. When membrane potential fluctuations are so small that neurons in the previous layer do not spike, the weight update equals zero.

Download figure:

Standard image

High-resolution image

To check whether depth increases the generalization performance of trained networks, we compared the test error of successfully trained CSNNs with one, three, seven, and ten hidden layers. We found that deeper networks did not show better generalization performance than one-layer networks (figure 4(b)). These findings suggest that the addition of multiple hidden layers does not provide an advantage in recurrently connected CSNNs on the SHD dataset. Since recurrently connected networks can be considered as deep in time, we were wondering whether strictly feed-forward SNNs would benefit from increasing depth. To that end, we repeated training of deep CSNNs with corresponding layer sizes but without recurrent hidden layer connections on the SHD dataset (see methods). Indeed, we found that deep feed-forward SNNs performed better than shallow feed-forward SNNs (figure 4(b) and supplementary figure S6).

In addition to the classification accuracy, weight initialization affects training speed. To check whether fluctuation-driven initialization is conducive to fast training, we measured the number of required epochs to reach 90% accuracy on the training dataset in CSNNs with one and three hidden layers. We found that networks initialized in the fluctuation-driven regime (σU
= 1) trained fastest (figure 4(c) and supplementary figure S7). On average, CSNNs with one hidden layer initialized with target σU
= 1 reached 90% training accuracy after 19.2 epochs, and CSNNs with three hidden layers required 16.8 epochs to reach 90% training accuracy. Thus, fluctuation-driven initialization is conducive to fast training.

Vanishing SGs impair learning in deep SNNs. In deep ANNs initialization is closely related to the problem of vanishing or exploding gradients [7–9]. We wondered whether this mechanism, i.e., vanishing or exploding SGs, prevented training in deep SNNs when σU
falls outside the optimal regime. To test this idea, we initialized seven-layer CSNNs with different targets σU
and recorded the neuronal activity in hidden layers. Like in shallow SNNs (figure 3(g)), initializations with small σU
led to quiescent hidden layers in deep CSNNs, which impaired the activity propagation to deeper layers (figure 4(d)). Specifically, in networks initialized with σU
= 0.5, only the first four hidden layers exhibited spiking activity. This effect was amplified in networks initialized with σU
= 0.2, in which all but the first hidden layer were quiescent. In contrast, networks initialized with σU
= 2 exhibited a strong increase in firing rates in deeper layers, and initializations with σU
= 20 caused firing rates to saturate in all layers of the network. Only initializations with σU
= 1 led to stable activity propagation with a firing rate of $\approx 10$ Hz throughout the network.

We next investigated how impaired activity propagation influenced SG magnitudes. To that end, we recorded SG magnitudes at each hidden layer during training. In networks initialized with σU
= 0.5 and σU
= 0.2, in which spiking activity vanished in deep layers, each quiescent layer decreased SGs by approximately two orders of magnitude (figure 4(e)). As a result, the magnitude of weight updates in early layers decreased by several orders of magnitude consistent with the numerical value of the surrogate derivative for neurons at rest (0.023 for β = 20; see methods). Moreover, weight updates vanished in deeper layers, caused by the lack of presynaptic activity (figure 4(f)). In contrast, initializations with σU
⩾ 1 led to relatively stable SG and weight update magnitudes across all layers (figures 4(e) and (f)). Notably, gradients were consistently one to two orders of magnitude smaller in networks initialized with σU
= 20 compared to networks initialized with σU
= 1 or σU
= 2 (figures 4(e) and (f)).

In summary, the sensitivity to initialization in deep SNNs is caused by impaired activity propagation to deeper layers and associated vanishing SGs. Empirically we found that only initializations with σU
≈ 1 supported both propagation of sparse population activity and stable magnitudes of back-propagating SGs in deep networks.

Since the surrogate derivative used to compute SGs is to some extent freely tunable [5], one might argue that re-scaling it could provide a potential solution to vanishing SGs by ensuring stable gradient magnitudes during back-propagation (see methods). We tested this approach and found that a re-scaled SG can only prevent vanishing gradients in the absence of spiking at the cost of exploding gradients when the network does exhibit spiking which emerges over training (supplementary figures S8(a)–(c)). In strictly feed-forward networks, we found that the gradients were less prone to exploding, hence re-scaling the SG could potentially alleviate the problem of vanishing gradients (supplementary figures S8(d)–(f)) and therefore increase robustness to initialization. However, with increasing depth, exploding gradients would likely prevent successful training even in deep feed-forward SNNs.

Seeing that training of deep SNNs was sensitive to the magnitude of SGs [24], we speculated that the robustness to weight initialization we observed in three-layer CSNNs could be attributed to the use of our optimizer with a per-parameter learning rate during training (see methods). To test this idea, we trained three-layer CSNNs initialized with different σU
either with a smart optimizer [25, 26] or with stochastic gradient descent (SGD) without an optimizer. We found that networks trained with SGD were indeed more sensitive to the fluctuation magnitude at initialization (supplementary figure S9). This effect was especially prominent in recurrent CSNNs.

Homeostatic plasticity increases robustness to initialization in deep SNNs. Because quiescent hidden layers are closely linked to vanishing SGs and thus to preventing training in deep SNNs, a homeostatically maintained firing rate, as observed in biological neural networks [27–29], could rescue activity propagation and therefore enable training. To test this hypothesis, we implemented homeostatic plasticity as an additional regularization term in the loss function that sets a lower bound on the firing rate of each individual neuron [23], which penalizes quiescent neurons (figure 5(a); see methods). We trained three-layer recurrent CSNNs on the SHD dataset, either with or without the additional homeostatic plasticity term. Indeed, homeostatic plasticity rescued training performance for networks initialized with σU
≪ 1 (figure 5(b)).

Figure 5. Homeostatic plasticity increases the robustness to initialization in deep SNNs. (a) Illustration of the homeostatic activity mechanism as a firing rate regularizer. Homeostatic plasticity (green) prevents neurons from remaining silent by increasing the synaptic weights when the firing rate is low. In all our simulations, a complementary upper bound activity regularizer (gray), that acts on the population-level, prevents neurons from spiking incessantly. (b) Validation accuracy after training a deep convolutional SNN with three hidden layers on the SHD dataset as a function of σU
. The colored line corresponds to networks trained with an active homeostatic plasticity mechanism. The black line corresponds to the baseline without homeostatic plasticity. The shaded region around the lines indicates the range of values across five random seeds. (c) As panel (b), for networks that were primed for 10 epochs with a homeostatic plasticity mechanism prior to supervised learning. During supervised learning, the homeostatic mechanism was inactive. (d) Test error of the five best-performing models in terms of validation accuracy for models trained with homeostatic plasticity, homeostatic priming, and the baseline model.

Download figure:

Standard image

High-resolution image

Next, we investigated whether homeostatic plasticity was necessary throughout the whole training period, or whether rescuing activity propagation before supervised training would be sufficient to enable learning. To this end, we developed a form of dynamic initialization for SNNs involving a homeostatic priming period before training. During the initial priming period, initialized networks were optimized solely on the homeostatic objective to nudge the spiking activity into a regime conducive to learning. After priming, we removed the homeostatic objective and started the supervised training period as usual. Like ongoing homeostatic plasticity, the homeostatic priming period was capable of rescuing learning for initializations with σU
≪ 1 (figure 5(c)). However, in rare cases, the network did not train after successful priming and the restored spiking activity vanished during training on the supervised loss function.

We wondered whether homeostatic plasticity affected the network’s generalization performance and thus compared the test error of networks trained with the proposed homeostatic mechanisms. Neither ongoing homeostatic plasticity nor homeostatic priming had a systematic effect on the test error (figure 5(d)). Therefore, we concluded that both biologically inspired homeostatic plasticity and homeostatic priming are effective strategies to increase the robustness toward initialization in deep SNNs without impairing their performance.

Deep SNNs with skip connections are more robust to initialization. In deep ANNs, skip connections are standard practice to facilitate optimization and improve training performance [13, 30, 31]. For instance, residual networks [31] use residual connections, a specific type of identity skip connections whereby the inputs are added directly to the output of a layer or block. We argued that residual connections are ill-defined in SNNs as the spiking non-linearity would only allow adding spikes to the input spike train. Instead, we considered classic skip connections and asked whether they rescue spike propagation in deep CSNNs. We tested this idea in CSNNs with three hidden layers by implementing trainable skip connections between each hidden layer and the readout layer (supplementary figure S10a; methods). Skip connections indeed increased robustness to initialization, with respect to both large σU
> 10 and small σU
≪ 1 (supplementary figure S10(b)). However, generalization performance after training did not increase as a result of added skip connections (supplementary figure S10(d)). Notably, for initializations with small σU
≪ 1, optimized networks only propagated activity through the skip connection between the first hidden layer and the readout layer, effectively reducing the network to a single hidden layer. As skip connections did not prevent all layers from being quiescent in deep SNNs, we wondered whether homeostatic plasticity and skip connections complement each other and further increase performance for initializations with σU
≪ 1. Thus, we trained three-layer CSNNs with skip connections and ongoing homeostatic plasticity. Networks with combined skip connections and homeostatic plasticity also exhibited enhanced robustness to initialization but did not show a significantly better generalization performance (supplementary figures S10(c) and (d)). We concluded that skip connections are a viable approach to increase the robustness toward initializations with large σU
> 10 in deep CSNNs, but are not able to compensate for vanishing gradients in deep layers when σU
≪ 1.

Fluctuation-driven initialization performs robustly across datasets. Together, our results suggest that traditional Kaiming initialization used for ANNs is sufficient for training three-layer CSNNs, but breaks down when training seven-layer or deeper CSNNs on the SHD dataset. In contrast, our proposed initialization strategy with the target fluctuation parameter set to σU
= 1 yields close-to-optimal training performance in all three-, seven-, and even ten-layer networks. To directly compare fluctuation-driven and Kaiming initialization, we measured generalization performance in terms of test accuracy after training. As expected, we found only small differences in test accuracy for three-layer networks (figure 6(a); table 1).

Figure 6. Fluctuation-driven initialization enables training of deep SNNs across multiple datasets. (a) Test accuracy of three-layer CSNNs trained on the SHD dataset. Networks were initialized either with standard Kaiming initialization (Kaiming) or fluctuation-driven initialization with σU
= 1. All error bars indicate standard deviation across five runs. (b) Test accuracy of seven-layer CSNNs trained on the SHD dataset. Networks with Kaiming initialization were additionally trained with ongoing homeostatic plasticity (Kaiming & Hom. plast.). (c) Same as panel (a), but for two-layer feed-forward CSNNs trained on the CIFAR-10 dataset. (d) Same as panel (b), but for four-layer feed-forward CSNNs trained on the CIFAR-10 dataset. (e) Same as panel (a), but for six-layer feed-forward CSNNs trained on the DVS-Gesture dataset. (f) Same as panel (b), but for eight-layer feed-forward CSNNs trained on the DVS-Gesture dataset.

Download figure:

Standard image

High-resolution image

Table 1. Test accuracy in percent after training networks with different numbers of hidden layers and different initializations (Kaiming, Kaiming with homeostatic plasticity and fluctuation-driven initialization with σU
= 1) on the SHD, CIFAR-10, and DVS-Gesture datasets. Errors correspond to the standard deviation.

SHDCIFAR-10DVS-Gesture
nH = 3
nH = 7
nH = 2
nH = 4
nH = 6
nH = 8Kaiming
83.1 ± 1.2
4.5 ± 0.059.5 ± 0.810.0 ± 0.054.6 ± 37.19.1 ± 0.0Kaiming & Hom.—77.0 ± 2.9—
70.3
±
0.9
—82.3 ± 5.3Fluct.-driven82.7 ± 1.1
83.5
±
1.3
62.4 ± 0.3
65.6 ± 1.3
86.7 ± 1.2
86.4
±
1.7

Specifically, Kaiming initialized three-layer networks achieved an average test accuracy of 83.1% ± 1.2 (validation accuracy: 95.9% ± 1.6), while the same networks initialized with our proposed strategy reached an average test accuracy of 82.7% ± 1.1 (validation accuracy: 94.1% ± 1.7). Seven-layer networks initialized with Kaiming initialization performed close to chance level after training (test accuracy: 4.5% ± 0.0; validation accuracy: 4.7% ± 0.5), while networks initialized with σU
= 1 reached 83.5% ± 1.3 accuracy on the test set (figure 6(b); table 1; validation accuracy: 94.9% ± 1.0). As homeostatic plasticity was able to compensate for suboptimal initializations by rescuing activity propagation in three-layer CSNNs, we wondered whether these results extend to seven-layer networks. To this end, we trained Kaiming-initialized seven-layer CSNNs with ongoing homeostatic plasticity. Indeed, homeostatic plasticity rescued training, but test accuracy after training (test accuracy: 77.0% ± 3.0; validation accuracy: 95.0% ± 1.8) was worse compared to networks initialized with σU
= 1 that were trained without homeostatic plasticity (figure 6(b); table 1).

So far, we have limited our investigation to initialization-dependence to deep CSNNs trained on the SHD dataset, which is relatively small and may thus be prone to overfitting. To test whether our findings would generalize to other tasks, we trained deep feed-forward CSNNs on two additional datasets from different input modalities. First, we considered CIFAR-10, a dataset consisting of static images. To translate static image input into spiking, we augmented the networks with an additional layer of simulated sensory neurons into which we injected the individual image pixel values as currents. Both bias currents and current gain were optimized end-to-end with all other network parameters (see methods). We then constructed deep CSNNs with increasing numbers of hidden layers (see table 6; methods). As before, networks were either initialized with traditional Kaiming initialization or with a target membrane potential fluctuation magnitude of σU
= 1. We observed that networks with up to two hidden layers showed good training performance with both initializations (figure 6(c); table 1). When we increased the number of hidden layers to four, networks initialized with σU
= 1 continued to show good training performance, while networks initialized with Kaiming initialization failed to train (figure 6(d); table 1). Training on CIFAR-10 with ongoing homeostatic plasticity was able to rescue learning in Kaiming initialized SNNs with four hidden layers.

Since CIFAR-10 is a still image dataset, which lacks temporal dynamics, it is less well suited for assessing SNNs performance. To check whether our results generalize to other commonly used SNN datasets, we considered the DVS-Gesture dataset [32], which consists of short videos that depict humans performing different hand gestures. These videos were recorded using an event camera, yielding event-based outputs that can be used to train SNNs on the classification of the performed gestures. As before, we initialized deep CSNNs with an increasing number of hidden layers using either Kaiming initialization or a target σU
= 1 and compared their test accuracy after training (see table 6; methods). We found that networks with up to six hidden layers could be successfully trained using either Kaiming or our proposed initialization (figure 6(e); table 1). However, in six-layer networks, initialization with σU
= 1 yielded more reliable training performance and higher accuracy than Kaiming initialization. When we increased the number of hidden layers to eight, networks initialized with Kaiming initialization did not train successfully, while networks initialized with a target σU
= 1 continued to show good learning performance (figure 6(f); table 1). As already observed on the SHD and CIFAR-10 datasets, training of Kaiming initialized deep networks could be rescued by adding homeostatic plasticity during training.

Taken together, these findings paint a clear pattern of initialization dependencies across datasets: up to a certain number of hidden layers, which is dataset dependent, Kaiming initialization yields good training performance in SNNs. However, when networks become too deep, vanishing SGs prevent training in networks with Kaiming initialization. In contrast, our proposed initialization strategy enables learning at high performance for deeper networks when the target fluctuation magnitude is set to σU
= 1. As a complementary data-dependent strategy, homeostatic plasticity can be used to prevent vanishing gradients and rescue learning in deep networks that were initialized in a suboptimal regime.

2.3. Initializing SNNs that obey Dale’s law

Neurons in biological SNNs are separated into excitatory and inhibitory populations, a constraint commonly known as Dale’s law [33]. With added biological constraints, functional SNNs constitute an important in silico model system for computational neuroscience. To advance the development of biologically constrained SNNs, we extended our initialization theory to SNNs obeying Dale’s law (see methods), i.e., in which each hidden layer consists of recurrently connected but separate excitatory and inhibitory populations (figure 7(a)). At initialization, we require a balance between excitatory and inhibitory currents (μU
= 0), as is commonly observed in biology [34, 35]. To accomplish such balance, we assume that excitatory and inhibitory synaptic weights are drawn from independent exponential distributions, whose mean values are set according to our theory to ensure the desired membrane potential dynamics (supplementary figure S11). This strategy allowed us to initialize Dalian networks with the same target σU
as non-Dalian networks.

Figure 7. Initialization of Dalian SNNs in the fluctuation-driven regime. (a) Schematic of a shallow SNN obeying Dale’s law. Excitatory (red) and inhibitory (blue) populations are recurrently connected, but separate. (b) Snapshot of network activity over time after training a shallow SNN obeying Dale’s law on the SHD dataset. Bottom: spike raster of input layer activity from two samples corresponding to two different classes. Middle: spike raster of excitatory (red) and inhibitory (blue) activity in the hidden layer. Top: membrane potential of readout units. The readout units corresponding to the two input classes are highlighted in different shades. (c) Performance comparison of Dalian and non-Dalian shallow SNNs. Left: validation accuracy after training on the SHD dataset as a function of initialization target σU
. The shaded region around the lines indicates the range of values across five random seeds. The sand-colored shaded region corresponds to our suggested target fluctuation magnitude 1 ⩽ ξ ⩽ 3. Right: test error of the five best-performing models in terms of validation accuracy, for Dalian and non-Dalian SNNs. Error bars mark ± one standard deviation. (d) As panel (c), for Dalian and non-Dalian three-layer CSNNs.

Download figure:

Standard image

High-resolution image

To test whether Dalian networks in the fluctuation-driven regime could be trained to high accuracy like their non-Dalian counterparts, we first considered fully connected recurrent Dalian SNNs with one hidden layer trained on the SHD dataset (see methods). Dalian networks initialized with σU
= 1 accurately solved the SHD task after training for 200 epochs (99.8% ± 0.0 train & 82.2% ± 1.2 test accuracy; figure 7(b)). Next, to test the robustness to initialization in Dalian networks, we initialized Dalian SNNs with different targets σU
and trained them on the SHD dataset. For direct comparison between Dalian SNNs and non-Dalian SNNs, we constructed SNNs with a total of nh = 160 hidden layer neurons, which were further split into nexc = 128 and ninh = 32 neurons for the Dalian case (see table 4; methods). After training, the Dalian networks exhibited similar robustness to initialization as non-Dalian networks (figure 7(c)). While we did not observe a large difference between Dalian and non-Dalian networks in validation accuracy, Dalian networks exhibited higher accuracy on the SHD test dataset. This result suggests that the separation into excitatory and inhibitory populations could provide a functionally beneficial constraint for SNNs with one recurrently connected hidden layer trained on the SHD dataset.

We wondered whether the better generalization performance of shallow Dalian SNNs would extend to deeper CSNN network architectures. To address this question, we constructed Dalian CSNNs with three hidden layers (see methods). Again, networks were initialized with different targets σU
and trained on the SHD dataset. We found that Dalian CSNNs with three hidden layers were more sensitive to initialization than their non-Dalian counterparts (figure 7(d)). However, when successfully trained, Dalian and non-Dalian CSNNs resulted in similar test accuracies.

In summary, our initialization strategy extends to Dalian SNNs with different network architectures and enables robust training on the SHD dataset. Unexpectedly, constraining networks with Dale’s law increased generalization accuracy by 7.1% in shallow networks. However, this effect did not generalize to deep CSNNs. Thus initializing Dalian networks in the fluctuation-driven regime is beneficial for their training and it will be interesting future work to study whether and how these findings generalize to larger datasets.