Siamese NN Recipes with Keras

Practical Siamese neural network recipes with Keras and BERT for semantic similarity tasks

I have been enjoying Siamese networks for different NLU tasks at my work for quite some time. In this article, I’ll share quick recipes with Keras, featuring Glove vectors or BERT as the text vectorizer. We’ll focus on semantic similarity calculations. Semantic similarity is basically the task of determining if a group of text is related. Semantic similarity is usually calculated between a pair of text segments. In this article, I’ll give examples of how to compare two texts.

A Siamese network is a NN with two or more inputs (typically number of inputs is two, otherwise one has to define 3-way distance functions). We encode the input texts, then feed the encoded vectors to a distance layer. Finally we run a classification layer on top of the distance layer. Distance can be cosine distance, L1 distance, exponential negative Manhattan distance and any other distance function. Here’s a Siamese network as a black box:

Siamese NN from a higher level. We feed 2 inputs and NN outputs if they’re semantically related

Encoding the input text can be via LSTM, Universal Encoder or BERT. Depending on the text encoding with word embeddings, the architecture looks like the following figure:

Siamese architecture in detail. We first encode the input sentences with LSTM/BERT, then we feed the encoded vector pair to a distance layer.

Here’s the Keras recipe with LSTM and a pretrained Embedding layer:

from keras import backend as K
first_sent_in = Input(shape=(MAX_LEN,))
second_sent_in = Input(shape=(MAX_LEN,))
embedding_layer =  Embedding(input_dim=n_words+1, output_dim=embed_size, embeddings_initializer=Constant(embedding_matrix), input_length=MAX_LEN, trainable=True, mask_zero=True)
first_sent_embedding = embedding_layer(first_sent_in)
second_sent_embedding = embedding_layer(second_sent_in)
lstm =  Bidirectional(LSTM(units=256, return_sequences=False))
first_sent_encoded = lstm(first_sent_embedding)
second_sent_encoded = lstm(second_sent_embedding)
l1_norm = lambda x: 1 - K.abs(x[0] - x[1])
merged = Lambda(function=l1_norm, output_shape=lambda x: x[0], name='L1_distance')([first_sent_encoded, second_sent_encoded])
predictions = Dense(1, activation='sigmoid', name='classification_layer')(merged)
model = Model([first_sent_in, second_sent_in], predictions)
model.compile(loss = 'binary_crossentropy', optimizer = "adam", metrics=["accuracy"])
print(model.summary())
model.fit([fsents, ssents], labels, validation_split=0.1, epochs = 20,shuffle=True, batch_size = 256)

Model summary should look like:

Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            [(None, 200)]        0                                            
__________________________________________________________________________________________________
input_2 (InputLayer)            [(None, 200)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 200, 100)     1633800     input_1[0][0]                    
                                                                 input_2[0][0]                    
__________________________________________________________________________________________________
bidirectional (Bidirectional)   (None, 512)          731136      embedding[0][0]                  
                                                                 embedding[1][0]                  
__________________________________________________________________________________________________
L1_distance (Lambda)            (None, 512)          0           bidirectional[0][0]              
                                                                 bidirectional[1][0]              
__________________________________________________________________________________________________
classification_layer (Dense)    (None, 1)            1026        L1_distance[0][0]                
==================================================================================================
Total params: 2,365,962
Trainable params: 2,365,962
Non-trainable params: 0
______________________________

The two input layers inputs the two texts to be compared. Then, we feed the input words to the Embedding layer to get the word embeddings per each input word. After that, we feed the embedding vectors of first sentence to the LSTM layer and embedding vectors of second sentence to the LSTM layer separately and get a dense representation for the first text and the second text (represented with variables first_sent_encoded and second_sent_encoded ). Now comes the tricky part, merge layer. Merge layer inputs the dense representation of the first text and second text and computes the distance between them. If you look at he fourth layer of the model summary, you see L1_distance (Lambda) (this layer is technically a Keras Lambda layer), accepts two inputs, which are both are outputs of the LSTM layer. The result is a 512 dimensional vector and we feed this vector to the classifier. The result is a 1 dimensional vector, which is a 0 or 1 because I’m doing binary classification as similar or not similar. At the classification layer, I squashed the 512-dim distance vector by sigmoid (because I like this activation func a lot :)), also I compiled the model with binary_crossentory because again it’s a binary classification task.

In this recipe used l1-distance to calculate the distance between the encoded vectors. You can use cosine distance or any other distance. I particularly like l1 distance because it’s not so smooth as a function. Same applies to the sigmoid function, it provides the nonlinearity that neural networks for language need.

Recipe with BERT is just a bit different. We take out the embeddings + LSTM layers and place BERT layer instead (as BERT vectors include more than enough sequentiality!):

fsent_inputs = Input(shape=(MAX_L,), dtype="int32")                       fsent_encoded = bert_model(fsent_inputs)                       fsent_encoded = fsent_encoded[1]                                               
ssent_inputs = Input(shape=(150,), dtype="int32")                       ssent_encoded = bert_model(ssent_inputs)                       ssent_encoded = ssent_encoded[1]                                               
merged =concatenate([fsent_encoded, ssent_encoded])                                               predictions = Dense(1, activation='sigmoid', name='classification_layer')(merged)                                               
model = Model([fsent_inputs, ssent_inputs], predictions)                                               adam = keras.optimizers.Adam(learning_rate=2e-6,epsilon=1e-08)                                           
model.compile(loss="binary_crossentropy", metrics=["accuracy"], optimizer="adam")

Here, we again feed a pair of text inputs to BERT layer separately and get their encodings fsent-encoded and ssent_encoded. We use the [CLS] token’s embedding which captures an average representation of the sentence. (BERT layer has 2 outputs, the first one is the vector for CLS token, and the second is a vector of (MAX_LEN, 768). The second output gives an vector for each token of the input sentence. We used the first input by calling fsent_encoded = fsent_encoded[1] and ssent_encoded = ssent_encoded[1]). Optimizer is again Adam, but with a bit different learning rate (we lower down the lr to prevent BERT behaving aggressive and overfit. BERT overfits quickly if we don’t prevent it). Loss is again binary cross entropy because we’re doing a binary classification task. Basically I replaced Embedding layer + LSTM with BERT layer, rest of the architecture is the same.