Leveraging siamese networks for one-shot intrusion detection model | SpringerLink

7.1

Evaluation metrics

This section discusses the metrics used to evaluate the model. The model evaluation (Algorithm 3) yields a Confusion Matrix (CM) that outlines the performance. A sample CM is presented in Table 5. Each row of the CM represents a class; True Positive (TP) is the number of attack instances correctly classified as attack; True Negative (TN) is the number of normal instances correctly classified as normal; False Positive (FP) is the number of normal instances wrongly classified as attack; False Negative (FN) is the number of attack instances wrongly classified as normal.

Table 5 Sample Confusion Matrix

Full size table

The overall accuracy is calculated as shown in (5). True Positive Rate (TPR) and False Negative Rate (FPR) for each class are shown in (6) and (7) respectively; finally, True Negative Rate (TNR) and False Positive Rate (FPR) are calculated using (8) and (9) respectively.

$$Overall Accuracy = \frac{TN + {\sum}_{i=1}^{4}TP_{ii}} {TN + {\sum}_{i=1}^{4}{\sum}_{j=1}^{4}TP_{ij} + {\sum}_{i=1}^{4}FP_{i} + {\sum}_{i=1}^{4}FN_{i} }$$

(5)

$$TPR_{i} = \frac{TP_{ii}} {FN_{i} + {\sum}_{j=1}^{4}TP_{ij}}$$

(6)

$$FNR_{i} = \frac{FN_{i}} {FN_{i} + {\sum}_{j=1}^{4}TP_{ij}}$$

(7)

$$TNR = \frac{TN} {TN + {\sum}_{i=1}^{4}FP_{i}}$$

(8)

$$FPR = \frac{{\sum}_{i=1}^{4}FP_{i}}{TN +{\sum}_{i=1}^{4}FP_{i}}$$

(9)

7.2

Results

7.2.1

One excluded class

The number of hidden layers and neurons for the ANNs used as the building block for the twin networks and their optimised architecture are as follows (bold is used for the input layer, italic is used for the output layer of the Siamese Network before similarity calculation and Dr is a Dropout layer).

  • CICIDS2017: 31:25:Dr(0.1):20:Dr(0.05):15

  • NSL-KDD and KDD Cup’99: 118:98:Dr(0.1):79:Dr(0.1):59:Dr(0.1):39:Dr(0.1):20

The following lists the optimised hyper-parameters:

  • Activation function: Relu

  • L2: 0.001

  • Optimiser: Adam

  • Number of Epochs: 2000

The evaluation specifies how accurately the proposed network can classify both classes used in training and new attack classes without the need for retraining. The model leverages similarity-based learning. The new attack class is represented using one sample to mimic the labelling process of new attacks.

For each dataset evaluation, multiple experiments are conducted. Specifically, K (N − 1) experiments are evaluated, where N is the number of classes and K is the number of attack classes in order to evaluate the performance of the Siamese Network when using a different set of attack classes for training and evaluation. In each experiment, a separate attack class (e) is excluded, one at a time. The CM is presented alongside the overall model accuracy for each experiment.

The results of the evaluation of the performance impact of the number of labelled samples (j) of the new attack class e are presented in terms of overall accuracy, new attack True Positive Rate (TPR) and False Negative Rates (FNR), Normal True Negative Rate (TNR) and False Positive Rate (FPR), listed using j instances for majority voting, where j ∈{1,5, 10,15,20,25,30}. The CMs use j = 5.

The CMs of the CICIDS2017 One-Shot, excluding SSH class is presented in Tables 6 and excluding FTP in Table 7. The overall accuracy is 81.28% and 82.5% respectively. The results demonstrate the network capability to adapt to the emergence of a new cyber-attack after training. It is important to note that the new attack class performance is 73.03% and 70.03% for SSH and FTP respectively. Moreover, the added class demonstrates low FNRs, specifically 8% and 15% for FTP and SSH respectively.

Table 6 CICIDS2017 One-Shot Confusion Matrix (SSH not in Training)

Full size table

Table 7 CICIDS2017 One-Shot Confusion Matrix (FTP not in Training)

Full size table

Additionally, compared to the TPR of recent research, it is shown that when performing a multi-class classification using ANNs with all classes included in both training and testing, the SSH and FTP recall are 98% and 77% respectively (Hossain et al., 2020). In another study the TPRs are 0% and 3.1% respectively (Vinayakumar et al., 2019). One-to-one comparison is not practical, since in the proposed model, classes are excluded from training, but the multi-class classification results provide context and show that the proposed model results fall in line with the literature. Furthermore, the evaluation of the model is not subject to the class-imbalance issue. Classes are equally represented in both training and testing batches.

Furthermore, on inspection of Tables 8 and 9, it is evident that using five labelled instances of the new attack class results in an increase in both the overall accuracy and the TPR together with a drop in the FNR. Using only 1 labelled instance demonstrates a comparably poorer performance owing to the instance selection randomness, which could result in either a good or a bad class representative. However, using 5 random labelled instances boosts performance, reinforcing the importance of having distinctive class representatives.

Table 8 CICIDS2017 One-Shot Accuracy (SSH not in Training) Using Different j Votes

Full size table

Table 9 CICIDS2017 One-Shot Accuracy (FTP not in Training) Using Different j Votes

Full size table

The remainder of the CICIDS2017 results are characterised by similar behaviour. The full evaluation tables are listed in Appendix A for transparency and reproducibility. The results are listed as follows. DoS (Hulk) results are presented in Tables 16 and 17. The TPR rises from 50.97% when using one pair to 72.82% when using 30 pairs. DoS (Slowloris) results are presented in Tables 18 and 19, where the TPR rises from 91.07% when using one pair to 95.18% when using 30 pairs.

The CMs of the KDD Cup’99 and NSL-KDD data sets One-Shot, excluding the DoS attack from training are presented in Tables 10 and 11, respectively; the overall accuracies are 76.67% and 77.99%. It is important to note however, that the False Negative rates for the new class (i.e. DoS) are 26.38% for the KDD Cup’99 and 9.87% for the NSL-KDD. Additional to the observations arising from the CICIDS2017 evaluation, these results highlight two further elements; (a) the Siamese Network did not find a high similarity between the new attack and the normal instances; (b) the new attack class TPR in the NSL-KDD results is significantly higher than KDD Cup’99 (78.87% compared to 40.28%), because the NSL-KDD is an enhanced version of the KDD Cup’99 (filtered and duplicate instances removed). Knowing that the new class is not used in the training phase and the similarity is only calculated from a few instances, a better representation of instances improves performance (i.e. NSL-KDD instances). Results confirm that new labelled instances need to be appropriate representatives (Tables 12 and 13).

Table 10 KDD One-Shot Confusion Matrix (DoS Not in Training)

Full size table

Table 11 NSL-KDD One-Shot Confusion Matrix (DoS Not in Training)

Full size table

Table 12 KDD One-Shot Accuracy (DoS not in Training) Using Different j Votes

Full size table

Likewise, In consideration of completeness, the remaining NSL-KDD and the KDD Cup’99 results – which demonstrate similar performance – are listed as follows; excluding Probe results are listed in Tables 20, 21, 26 and 27; 24, 25, 30 and 31 present the results when excluding R2L; Finally, excluding U2R are in Tables 22, 23, 28 and 29.

7.2.2

Two excluded classes

A second experiment is conducted to further assess the performance of the model. Unlike the results in Section 7.2.1, three classes are used to train the network and two classes excluded from the training. The experiment is aimed at evaluating the robustness of the trained network to discriminate more than one class without the need for re-training, in the scenario when a few instances of the new class are available and until sufficient instances are gathered. The goal is to correctly classify and label new attacks not just to discriminate from benign/normal behaviour. When attacks are correctly classified, effective attack-specific countermeasures can be deployed.

Table 13 NSL-KDD One-Shot Accuracy (DoS not in Training) Using Different j Votes

Full size table

Table 14 presents the confusion matrix when DoS (Hulk) and FTP are excluded from the training. The detection accuracy is 69.13% and 86.42% for the Dos (Hulk) and FTP classes respectively; the FNR of the new classes is 11.93% and 8%. It is important to note that the TPR increases and the FNR decreases as more instances are used from each of class as evident in Table 15 reaching an FNR of 9.6% and 7.78% and a TPR of 72.85% and 83.58% for the DoS (Hulk) and FTP attacks respectively.

Table 14 CICIDS2017 One-Shot Confusion Matrix (DoS (Hulk) & FTP Not in Training)

Full size table

Table 15 CICIDS2017 One-Shot Accuracy (DoS (Hulk) & FTP not in Training) Using Different j Votes

Full size table