Visual explanations from spiking neural networks using inter-spike intervals | Scientific Reports
Mục Lục
SNN-crafted Grad-CAM
Grad-CAM25 highlights the region of the image that highly contributes to classification results. Grad-CAM computes a backward gradient from the output classifier logits to the pre-defined target layer. After that, channel-wise contribution score is obtained by using global average pooling. Based on this, the final heatmap is defined as the weighted sum of contribution scores across all feature maps or channels. Different from conventional ANNs, SNNs take spike trains as inputs across multiple time-steps. Therefore, we can compute multiple SNN-crafted Grad-CAM heatmaps across the total number of time-steps T. Similar to Grad-CAM, we quantify the contribution of each channel by accumulating gradients across all time-steps:
$$\begin{aligned} \alpha ^{c, k} = \frac{1}{N}\sum _i\sum _j\sum _t \frac{\partial y^c}{\partial A^k_{ij, t}}. \end{aligned}$$
(1)
Here, N is a normalization factor, and \(A^k_{ij, t}\) is the spike activation value of the kth channel at time-step t, and (i, j) is the pixel location. Note that we use a ground truth label c for a given image to compute the heatmap. Therefore, the channel-wise weighted sum of spike activation can be calculated as:
$$\begin{aligned} G^c_{ij,t} = max \left( 0, \sum _{k}\alpha ^{c,k}_{t}A^k_{ij,t} \right) . \end{aligned}$$
(2)
For a clear comparison with conventional ANN based Grad-CAM, we refer to \(G^c_{ij,t}\) as “SNN-crafted Grad-CAM” in the remainder of the paper. It is worth mentioning that we convert a static image to temporal spike trains using Poisson rate coding (See “Methods” for details).
SNN-crafted Grad-CAM suffers from what we term as a “heatmap smoothing effect” caused by the approximated backward gradient function. To visualize the heatmap at shallow/initial layers, the gradients from the output need to pass through multiple layers using the approximated backward function (see Supplementary Note 1). The accumulated approximation error yields a non-discriminative heatmap as shown in Fig. 2a. Note that the beginning and the ending time-steps have little spike activity20 resulting in heatmaps with zero values. To validate the “heatmap smoothing effect” quantitatively, we compute the pixel-wise variance of the heatmap. So, the heatmap containing non-discriminative information (i.e., similar pixel values) should have lower variance. In Fig. 2b, SNN-crafted Grad-CAM shows lower variance compared to our proposed SAM (will be discussed in the next section). In SNN visualization, there are multiple heatmaps (i.e., one heatmap per time-step). So, we use the maximum variance value across all time-steps for Fig. 2b. Further, we note that the heatmap visualization in both SAM and SNN-crafted Grad-CAM in Fig. 2a varies across each time-step underlying the fact that the SNN looks at different regions of the same input over time to make a prediction. Overall, the visualization tool for SNNs requires a new perspective that can circumvent the error accumulation problem of approximate gradients or backpropagation. In all our experiments, we use VGG111 architecture of SNN based on LIF neuron to perform image classification on the complex Tiny-ImageNet dataset, i.e., subset of ImageNet dataset38 (see Supplementary Table 1 for detailed information on the network architecture and dataset).
Spike activation map (SAM)
SAM is a new paradigm for bio-plausible visualization of SNNs. We do not need to use any class label or perform backpropagation to calculate gradients. SAM only uses the spike activity in forward propagation to calculate heatmaps. Thus, this visualization is not just for a specific class but highlights the regions that the network focuses on for any given image. Surprisingly, we observe that SAM shows meaningful visualization even without any ground truth labels (Fig. 2a). Mathematically, our objective can be formulated as finding a mapping function \(f(\cdot )\):
$$\begin{aligned} M_{t} \leftarrow f(S_{0}, S_{1}, \ldots , S_{t-1}), \end{aligned}$$
(3)
where, \(M_{t}\) is SAM and \(S_{t}\) is spike activity at time-step t. We leverage the biological observation that spikes with short inter-spike interval (ISI) highly contribute to the neural decision process32,33,34. This is because short ISI spikes are more likely to stimulate post-synaptic neurons, conveying more information33,39,40. To apply this to our visualization method, we first define the temporal spike contribution score (TSCS). For a given neuron, TSCS evaluates the contribution of a previous spike at time \(t’\) with respect to current time t. It is natural that the contribution of the previous spike with respect to the current neuronal state will decrease as time progresses. Therefore, the TSCS value can be formulated as:
$$\begin{aligned} T(t, t’) = \exp (- \gamma |t-t’|), \end{aligned}$$
(4)
where, \(\gamma \) is a hyperparameter which controls the steepness of the exponential kernel function.
Figure 3
Overall process of SAM. (a) Illustration of spike activation map (SAM). We illustrate an intermediate feature tensor with channel C, height H, and width W. For each channel, we compute a neuron-wise contribution score. After that, we sum all neuronal contribution score (NCS) along the channel axis to obtain the SAM heatmap. (b) The NCS for each neuron is based on the previous spike trajectory. For every spike, we define temporal spike contribution score (TSCS) with an exponential kernel. We take into account TSCS from previous spikes in order to compute NCS. Thus, NCS shows high value when more spikes exist in a short time window.
Full size image
Figure 4
Visualization of SAM. (a) Original images. (b) Grad-CAM results on reference ANN. (c) SAM results on SNN trained with surrogate gradient. (d) SAM results on SNN trained with ANN–SNN conversion. We visualize the internal spike representation of VGG11 using SAM at layer 4, layer 6, and layer 8. We show the visualization for 10 uniformly sampled time-steps. It is worth mentioning Grad-CAM exploits ground truth labels but our SAM can be obtained without any label information. Here, the networks are trained on Tiny-ImageNet dataset. We provide more visualization results in Supplementary Figs. 2–7.
Full size image
Figure 5
Localization error of SAM. We measure the localization error (i.e., the difference between SAM and Grad-CAM) in various SNN training configurations (See Methods). For all experiments, we train a VGG11 network on Tiny-ImageNet dataset. (a) Localization error at layer 4 (top row), layer 6 (middle row), and layer 8 (bottom row) with respect to hyperparameter \(\gamma \). The results show that zero \(\gamma \) value does not consider temporally evolving characteristic of SNNs and results in highest localization error. (b) Illustration of the normalized number of spikes with time. The spike activity with surrogate training shows Gaussian-like trend. On the other hand, a conversion approach yields nearly constant values after timestep 50. The characteristic of activity is related to visualization results in Fig. 4. (c) Localization error comparison across different layers. The localization error increases with deeper layers since the visualization tool focuses on more selective information in deep layers. For all layers, the conversion method shows a higher localization error compared to the surrogate learning method.
Full size image
Figure 6
Visualizing robustness and sensory suppression behavior of SNN with SAM. We use the VGG11 networks with Tiny-ImageNet dataset. (a) Visualization of robustness with SAM. We show the Grad-CAM and SAM results with respect to the clean and adversarial images. Heatmap from SNN with SAM shows less variation compared to the ANN counterpart (see Supplementary Fig. 8 for additional visualization results). (b) Classification accuracy with respect to varying attack strengths of Fast Gradient Sign Method (FGSM) attack. We compute the normalized L1 distance between heatmaps for clean X and adversarial inputs \(X_{Adv}\) at \(\epsilon = \frac{4}{255}\). For SNN, we report the maximum L1 distance across multiple time-steps. (c) Visualization of SAM for multi-object images. We concatenate two images vertically and visualize the region where the networks focuses on. Note, since we use Global Average Pooling after the convolutional feature extractor, the networks can make predictions regardless of the input image resolution. The network attends one of the two objects at the end of the time-steps. We also provide the probabilities of two classes predicted from the output classifier of the VGG11 model across time.
Full size image
To consider multiple previous spikes, we define a set \(P_{ij}^k\) that consists of previous firing times of a neuron at location (i, j) in kth channel. For every time-step, we compute a neuronal contribution score (NCS) \(N^k_{ij,t}\) at time-step t, by summing all TSCS values of previous spikes in \(P_{ij}^k\):
$$\begin{aligned} N^k_{ij,t} = \sum _{t’ \in P_{ij}^k} T(t, t’). \end{aligned}$$
(5)
Thus, a neuron has high NCS if it spikes frequently over a short time interval and vice-versa. Finally, we calculate the SAM heatmap \(M_{ij,t}\) at time-step t and location (i, j) by multiplying spike activity \(S_{ij,t}\) with NCS value \(N_{ij,t}\) and summing across all k channels:
$$\begin{aligned} M_{ij,t} = \sum _{k} N_{ij,t}^k S^k_{ij,t}. \end{aligned}$$
(6)
We illustrate the overall flow of SAM in Fig. 3a. For every neuron, we compute NCS and add the values across the channel axis in order to get SAM. In Fig. 3b, we depict two examples (case A and case B) for calculating NCS. In case A, the previous spikes occur at time-step \(t_{p1}\) and \(t_{p2}\) that are reasonably earlier than the current spike time t. As a result, the contribution of previous spikes is small due to the exponential kernel. On the other hand, in case B, \(t_{p1}\) and \(t_{p2}\) are close to the current spike time t. In this case, the neuron has a high NCS value.
In Fig. 4, we visualize the qualitative results of SAM on SNNs trained with surrogate learning (Fig. 4c) as well as ANN–SNN conversion (Fig. 4d). We also show the Grad-CAM visualization obtained from a corresponding ANN for reference (Fig. 4b). Note that SAM does not require any class label, while Grad-CAM uses ground truth labels to create heatmaps. Interestingly, heatmaps obtained from SAM across different time-steps on SNNs shows a similar result with Grad-CAM on ANNs. The region of interest in SAM is highlighted in a discriminative fashion. This supports our assertion that SAM is an effective visualization tool for SNNs. Moreover, the results imply that ISI and temporal dynamics can yield intepretability for deep SNNs. So far, no studies have analysed the underlying information learnt in different layers of an SNN. It has been always assumed that SNNs like ANNs learn features in a generic-to-specific manner as we go deeper. For the first time, we visualize the explanations at intermediate layers of SNNs to support this assumption. Interestingly, with surrogate learning, the SAM visualization (Fig. 4c) shows that shallow layers of SNNs represent low-level structure and deep layers focus on semantic information. For example, layer 4 highlights the edges or blobs of the lion, such as eyes and nose. On the other hand, layer 8 highlights the full face of the lion. More visualization results are provided in Supplementary Figs. 2–7.
Further, we conduct ablation studies to understand the effect of hyperparameter \(\gamma \) on SAM in Eq. 4. The \(\gamma \) value decides the steepness of the exponential kernel function in TSCS. A kernel with high \(\gamma \) takes into account very recent spike history, where as low \(\gamma \) considers longer spike history. In Fig. 5a, we visualize the localization error with respect to \(\gamma \) for different layers in VGG11 SNN for conversion and surrogate gradient training methods. For both methods, \(\gamma = 0\) shows the highest localization error since the kernel does not filter redundant and irrelevant long ISI spikes. Another interesting observation is that the localization error increases for large gamma value (e.g., 1.0). This is because high \(\gamma \) limits reliable visualization considering only very recent spikes and ignores spike history to a large extent.
Comparison between surrogate gradient learning and conversion
We compare the SAM visualization results of surrogate gradient learning and ANN-SNN conversion in Fig. 4c, d. From the figure, we observe a trend in the heatmap visualization of surrogate gradient learning with zero activity at early time-steps leading to discriminative activity in the mid-range followed by zero activity again towards the end. In contrast, conversion maintains similar heatmaps during the entire time period. This is related to the variation in spike activity for each time-step as shown in Fig. 5b. Since surrogate gradient learning considers temporal dynamics during training6,20, each layer passes the information (i.e., the number of spikes) continuously over time. On the other hand, conversion does not show any temporal propagation (see Supplementary Fig. 1 for more detailed explanation). Moreover, we observe that surrogate gradient learning has more accurate (i.e., similar to Grad-CAM from ANN) heatmaps highlighting the region of interest across all layers. Notably, the conversion method highlights only partial regions of the object (e.g., lemon) and in some cases (e.g., bird) the wrong region. This observation is supported by the localization error comparison in Fig. 5c. For all layers, surrogate gradient learning shows lower localization error. It is evident that conversion methods do not account for any temporal dynamics during training6. We believe that this missing temporal dependence accounts for less interpretability. Thus, we assert that SNNs obtained with surrogate gradient learning (incorporating temporal dynamics) are more interpretable. Therefore, all visualization analyses in the following subsections focus on the surrogate gradient learning method.
Adversarial robustness of SNN
Different from a human visual system, neural networks are vulnerable to adversarial attacks. These attacks are created by adding small, yet, structured perturbations to the input image37. Previous studies36,41 have asserted that SNNs trained with surrogate gradients are more robust to adversarial inputs than ANNs. In order to show the effectiveness of SNNs under adversarial noise attack, we conduct a qualitative and quantitative comparison between Grad-CAM and SAM. We attack both ANN and SNN using Fast Gradient Sign Method (FGSM) attack37 and SNN-crafted FGSM attack36 with \(\epsilon = \frac{4}{255}\) (see “Methods” and Supplementary Note 3 for implementation details). In Fig. 6a, we observe that Grad-CAM shows large change in visualization before/after attacking the ANN. In fact, the ANN after attack starts focusing on random parts of the image and therefore misclassifies the adversarial inputs. On the other hand, SAM shows almost similar results before/after attack. Interestingly, we observe that SAM in case of adversarial attack slightly changes at the earlier time-steps with respect to the clean input visualization. But, as time progresses, the visualization between the adversarial input and the clean input look similar highlighting suitable regions of interest. This implies that the temporal processing in SNNs enables compensation and correction of any noise in the input. We surmise that accumulating temporal information in SNNs imparts robustness to the system. Further, we show the classification accuracy with respect to the attack intensity, and normalized L1-distance between heatmaps of clean and adversarial images at \(\epsilon = \frac{4}{255}\) in Fig. 6b. The results show that SNN is more robust than ANN in terms of both accuracy and visualization (see Supplementary Fig. 9 for additional experiments).
Surrogate learning will be more interpretable since it inherits better temporal dynamics is a widely-adopted intuition. Similarly, it is a widely accepted notion that temporal SNNs are more resilient to adversarial attacks than ANNs. However, with SAM, for the first time, we are able prove and explain our intuitions using visual explanations. Thus, SAM is a gateway to interpretable neuromorphic computing. For instance, SAM can enable SNN deployment for secure and intelligent systems (e.g., military defense) where robustness and interpretability (to gain user’s trust in the prediction made by the model) are paramount.
Sensory suppression behavior of SNN
Neuroscience studies have suggested that human brain undergoes42,43,44 “sensory suppression”. That is, the brain focuses on one of multiple objects when these objects are presented at the same time. Co-incidentally, with SAM, we observe that SNNs also emulate sensory suppression when presented with multiple objects. To show this, we concatenate two randomly chosen images from Tiny-ImageNet dataset and pass the concatenated image into the SNN trained with surrogate gradient learning. Interestingly, as shown in Fig. 6c, neurons compete in the earlier time-steps for attending to both objects and finally focus/attend on only one object at later time-steps. Note, for each image, the final prediction from the SNN matches the final heatmap shown by SAM. For each timestep, we also visualize the confidences of two classes in the last layer (i.e., classifier). The confidence of each object is also varying according to the attended area by the network. These results unleash the bio-plausible characteristics of SNNs and further establish SAM as a suitable interpretation tool (Supplementary Fig. 10 provides more examples).