Constrained transformer network for ECG signal processing and arrhythmia classification – BMC Medical Informatics and Decision Making

Problem statement and data processing

Before discussing methodology, the diagnosis problem that is the aim of this paper is clarified and the data characteristics discussed.

Problem statement

ECG data used in this paper were acquired from a cardiology challenge [25], which was collected from 11 hospitals and covering a total of 6877 individuals. These data have been de-sensitized, with a total of 3699 records for men and 3178 for women. The duration of the signal is between 6 and 60 s, with an average duration of 15.79 s. The data were recorded by a 12-lead ECG, with a frequency of the electrocardiogram recording of 500 Hz. The 12 waveforms of an ECG signal sample are presented in Fig. 1.

Each sample has a tag (label) for its category. There are nine categories in total, including one normal type of heart disease and eight abnormal types.The data category description are shown in Table 1. The main problem to be solved in this work can be formulated as follows: given 12-lead ECG signal data, the data are segmented through a time window and fed into a model for learning, and finally the classification scores of the 9 categories are obtained using the classification model.

Table 1 Data category description

Full size table

Data pre-processing

Noise is inevitable in collecting ECG signals. Noise includes baseline drift and high-frequency noise. There are many ways to de-noise ECG signals, such as designing high-pass or median filters to eliminate baseline drift. In this paper, we apply the difference method and wavelet transform in signal processing to improve the quality of the signal. For the abnormal values that appear in the ECG signals, it is found that the abnormal values have relatively larger voltage values than the normal signals, so we use the difference method to remove these abnormal values. First, we set the threshold values after traversing the complete ECG signals, and replace the abnormal values with the threshold value when the voltage values is greater than the threshold values. Then, we can obtain ECG data with no abnormal values. For the ECG signals containing noise, we performed six layers of wavelet decomposition on the ECG signals and selected the bior2.6 wavelet function to obtain the detail coefficients and approximation coefficients of each layer. The EMG interference noise is distributed in the high-frequency components of the first layers of decomposition, while the noise of baseline offset is distributed in the low-frequency components of the sixth layer. Therefore, we set all the detail coefficient components in the first and second layers to 0, and set the approximate coefficient components in the sixth layer to 0. Finally, we reconstruct the signal layer by layer. After the reconstruction, we obtain the ECG signal without outliers and noise. The combination of difference method and wavelet transform method can eliminate noise interference and outliers.

Since the length of the ECG signals is not equal, we split the ECG signals into segments of fixed length according to the given window size and step size. The size of each window is set to match the integrity of the regular heartbeat. All experimental parameters will be given in “Experimental settings” section.

Model architecture

A new end-to-end model for ECG classification was designed that combines the advantages of a CNN and transformer networks. The architecture of the proposed model, which is designed to handle variable-length 12-leads ECG data, is shown in Fig. 2. An ECG record is divided into equal-length ECG signal segments according to the window size and step size given in the pre-processing stage. The 12-lead data are then passed to the CNN to capture the hidden deep features in the ECG signal.

Fig. 2figure 2

Architecture of proposed model

Full size image

Next, the linear network structure is used to further capture the feature information, which was then sent to the transformer network in the form \(X_{cnn}=[x[0],x[1],…,x[n]]\). The transformer network can output the embedding vector of the input ECG signal \(X_{embed}\), which is finally fed into the classification layer to obtain the class probability of the ECG signal.

As shown in Fig. 2, the proposed model consists of four components: (1) the link constraints, (2) feature-extraction units, (3) transformer network, and (4) classification layers.

Link constraints

To improve the quality of embedding features for downstream task, the following assumption on the embedding features is made.

If the correlation coefficient between embedding features of two samples are large (\(max=1\)), which means positive correlation, the classifier will predict that they belong to the same category with a high probability. If the correlation coefficient is small (\(min=-1\)), the classifier will predict different categories with a high probability.

Based on the above assumptions, the correlation coefficient between the samples of the same class is made a larger value by \(\min \left\| X_{embed}^i-X_{embed}^j\right\| _2^2\). In contrast, the correlation coefficient between the samples of different classes is made smaller by \(\min \left\| X_{embed}^i+X_{embed}^j\right\| _2^2\). In the extreme condition, \(X_{embed}^i=X_{embed}^j\) when the correlation coefficient equals 1, and \(X_{embed}^i=-1 X_{embed}^j\) when the correlation coefficient equals −1.

Fig. 3figure 3

Schematic diagram of link constraints

Full size image

Borrowing the idea of [26], link constraints are added to the loss function. There are two types of links between the samples: a Must-link and a No-link. For the task of classification, the links between the samples of the same class are Must-links and the links between the samples of the different classes are No-links. Figure 3 shows that the embedding vectors of two samples are similar when they have a Must-link. Thus, the embedding vectors can better contribute to downstream tasks such as classification. It is essentially a regular term, and its formula is:

$$\begin{aligned} \begin{aligned} Loss = \min {f(\beta )+\lambda I(\beta )} \end{aligned}\end{aligned}$$

(1)

$$\begin{aligned} \begin{aligned} I(\beta ) =\sum _{pq}\frac{1}{2}\left\| \beta ^{p}-e_{(pq)} \beta ^{(q)}\right\| _{2}^{2} \end{aligned}\end{aligned}$$

(2)

$$\begin{aligned} \begin{aligned} e_{(ij)}= {\left\{ \begin{array}{ll} 1 &{} \hbox {A must-link between} i\,hbox{and}\,j \\ -1 &{} \hbox {A cannot-link between} i\,\hbox {and}\,j \end{array}\right. } \end{aligned} \end{aligned}$$

(3)

Based on Eqs. (1)–(3), the link constraints can make the embedding vectors of the same class closer and those of different classes. In the experiment, the embedding vector \(\beta\) is \(X_{embed}\) and the function f the classifier network (a linear network) after the transformer network. Moreover, the link constraints can only be applied in the training process like other regularization terms, such as L1 and L2. The specific process of link constraints is detailed in Algorithm 1.

figure a

Although the outputs \(X_{cnn}\) of the CNN [27], such as \(\beta\) can be used, since \(X_{cnn}\) sometimes has temporal information, i.e., the first element may have the information from the early time and the last element may have the late-time information, we cannot use outputs from a CNN as embedding vectors directly. Therefore, several layers are needed to disorganize the temporal information and usually take the outputs of BiLSTM or a transformer as the embedding vectors.

Feature extraction

CNNs have shown outstanding performance in image-classification tasks due to their translation-invariance and ability to capture local features [28, 29]. The essence of the convolution kernel is a filter, which is especially suitable for feature extraction of ECG signals. A CNN network with seven convolution layers, which have different kernel sizes to capture various features, was designed in the present study. Each convolution layer is composed of a convolution filter, batch normalization layer [30], active layer, and pooling layer. The parameters of the CNN’s layers are shown in Fig. 4.

Fig. 4figure 4

CNN layer parameters

Full size image

Fig. 5figure 5

Structure of transformer-network encoder

Full size image

Transformer layers

The transformer network [31] was developed based on the attention mechanism, which is composed of an encoder and decoder. In the ECG signal classification problem, only the encoder part is used, the structure of which is shown in Fig. 5. The transformer network contains eight identical layer stacks and each layer has two sub-layers. The first sub-layer is the multi-head attention and the second is a simple fully connected forward neural network. The two sub-layers are connected by a residual network structure followed by a norm layer. The output of each sub-layer can be expressed by \(out = LayerNorm(x+Sublayer(x))\), where each sub-layer is constructed independently. To facilitate the residual connection between layers, the sub-layers in the model are fixed output with 256 dimensions. These sub-layers are described as follows.

  • Scaled dot-product attention. The input of the attention function Q, K, and V represents query, key, and value, respectively. The attention weight is calculated according to the similarity of the query key. The attention context is obtained according to the attention weights. The model uses scaled dot-product attention, which is calculated as follows:

    $$\begin{aligned} \begin{aligned} Attention&(Q,K,V) = softmax(Q K_{T}/\sqrt{d_{k}} V) \end{aligned} \end{aligned}$$

    (4)

  • Multi-head attention. The multi-head attention mechanism projects Q, K, and V through h different linear transformations, and finally splices different attention results. Q, K, and V have the same values in the self-attention mechanism. The formula is expressed as follows:

    $$\begin{aligned} \begin{aligned} MultiHead&(Q, K, V) = Concat(head_{1},…,head_{h}) \end{aligned} \end{aligned}$$

    (5)

    $$\begin{aligned} \begin{aligned} head_{i} =Attention(QW_{i}^{Q},KW_{i}^{K},VW_{i}^{V}) \end{aligned} \end{aligned}$$

    (6)

    where MultiHead(QKV) is the contact of \(head_{i}\).

  • Position-wise feed-forward networks. In addition to the attention sub-layer, each layer of the encoder contains a fully connected feed-forward network and a two-layer linear transformation using a ReLU activation function:

    $$\begin{aligned} \begin{aligned} FFN(x)= max(0,xW_{1}+b_{1})W_{2}+b_{2} \end{aligned} \end{aligned}$$

    (7)

    While the linear transformations are the same across different positions, they use different parameters from layer to layer. The input size of the model is 256 and the size of the hidden layer is 1024.

  • Positional encoding. To make use of the order of sequence, “position encoding,” i.e., the relative or absolute position of the sequence, is added to the input embedding at the top of encoder. The positional encoding (PE) dimension is \(d_{model=256}\), the same as input embedding:

    $$\begin{aligned} \begin{aligned} PE_{(pos,2i)}=sin(pos/10000^{2i/d_{model}}) \end{aligned} \end{aligned}$$

    (8)

    $$\begin{aligned} \begin{aligned} PE_{(pos,2i+1)}=sin(pos/10000^{2i/d_{model}}) \end{aligned} \end{aligned}$$

    (9)

    where pos is the position and i the dimension.

Classification layers

The transformer network is connected to the classification layer for multi-classification. The classification layer is composed of linear layers and activation layers. The classification network outputs the probability that each patient may have for each type of heart disease.