How to predict Quora Question Pairs using Siamese Manhattan LSTM

The article is about Manhattan LSTM (MaLSTM) — a Siamese deep network and its appliance to Kaggle’s Quora Pairs competition . I will do my best to explain the network and go through the Keras code (if you are only here for the code, scroll down 🙂 Full code on Github

Other parameters such as batch size, epochs, and the gradient clipping norm value are chosen by me.

This is where I will diverge a little from the original paper. For the sake of simplicity, I do not use a specific weight initialization scheme and do not pretrain it on a different task.

The optimizer of choice in the article is the Adadelta optimizer , which can be read about in this article . We also use gradient clipping to avoid the exploding gradient problem. You may find a nice explanation of the gradient clipping in this video from the Udacity deep learning course.

and since we have an exponent of a negative the output (the prediction in our case) will be between 0 and 1.

By now we have the two vectors that hold the semantic meaning of each question. We put them through the defined similarity function (below)

We have two embedded matrices that represent a candidate of two similar questions. Then we feed them into the LSTM (practically, there is only one) and the final state of the LSTM for each question is a 50-dimensional vector. It is trained to capture semantic meaning of the question. In figure 1, this vector is denoted by the letter h. If you don’t entirely understand LSTMs, I suggest reading this wonderful post .

Inputs to the network are zero-padded sequences of word indices. These inputs are vectors of fixed length, where the first zeros are being ignored and the nonzeros are indices that uniquely identify words. Those vectors are then fed into the embedding layer. This layer looks up the corresponding embedding for each word and encapsulates all them into a matrix. This matrix represents the given text as a series of embeddings. I use Google’s word2vec embedding, same as in the original paper. The process is depicted in figure 2.

In MaLSTM the identical sub-network is all the way from the embedding up to the last LSTM hidden state. Word embedding is a modern way to represent words in deep learning models. More about it can be found in this nice blog post . Essentially it’s a method to give words semantic meaning in a vector representation.

So first of all, what is a “Siamese network”? Siamese networks are networks that have two or more identical sub-networks in them. Siamese networks seem to perform well on similarity tasks and have been used for tasks like sentence semantic similarity, recognizing forged signatures and many more.

(I will be using Keras , so some technical details are related to the implementation)

Few months ago I came across a very nice article called Siamese Recurrent Architectures for Learning Sentence Similarity .It offers a pretty straightforward approach to the common problem of sentence similarity. Named MaLSTM (“Ma” for Manhattan distance), its architecture is depicted in figure 1 (diagram excludes the sentence preprocessing part). Notice that since this is a Siamese network, it is easier to train because it shares weights on both sides.

In the past few years, deep learning is all the fuss in the tech industry. To keep up on things I like to get my hands dirty implementing interesting network architectures I come across in article readings.

CODE

Preprocessing

Figure 3 First lines of the raw training dataset

train_df = pd.read_csv(TRAIN_CSV)
test_df = pd.read_csv(TEST_CSV)

vocabulary = dict()

inverse_vocabulary = ['<unk>']
# '<unk>' will never be used, it is only a placeholder for the
# [0, 0, ....0] embedding

word2vec = KeyedVectors.load_word2vec_format(EMBEDDING_FILE,binary=True)

questions_cols = ['question1', 'question2']

# Iterate over the questions only of both training and test datasets
for dataset in [train_df, test_df]:
for index, row in dataset.iterrows():

# Iterate through the text of both questions of the row
for question in questions_cols:

q2n = [] # q2n -> question numbers representation
for word in text_to_word_list(row[question]):

# Check for unwanted words
if word in stops and word not in word2vec.vocab:
continue

if
word not in vocabulary:
vocabulary[word] = len(inverse_vocabulary)
q2n.append(len(inverse_vocabulary))
inverse_vocabulary.append(word)
else:
q2n.append(vocabulary[word])

# Replace questions with lists of word indices
dataset.set_value(index, question, q2n)

Figure 4 First lines of the converted to word indices training dataset.

Embedding matrix

embedding_dim = 300

# This will be the embedding matrix
embeddings = 1 * np.random.randn(len(vocabulary) + 1, embedding_dim)
embeddings[0] = 0 # So that the padding will be ignored


# Build the embedding matrix
for word, index in vocabulary.items():
if word in word2vec.vocab:
embeddings[index] = word2vec.word_vec(word)

Data preparation

# Split to train validation
validation_size = 40000
training_size = len(train_df) - validation_size

X = train_df[questions_cols]
Y = train_df['is_duplicate']

X_train, X_validation, Y_train, Y_validation = train_test_split(X, Y, test_size=validation_size)

# Split to dicts
X_train = {'left': X_train.question1, 'right': X_train.question2}
X_validation = {'left': X_validation.question1, 'right': X_validation.question2}
X_test = {'left': test_df.question1, 'right': test_df.question2}

# Convert labels to their numpy representations
Y_train = Y_train.values
Y_validation = Y_validation.values

# Zero padding
for dataset, side in itertools.product([X_train, X_validation], ['left', 'right']):
dataset[side] = pad_sequences(dataset[side], maxlen=max_seq_length)

Model

def exponent_neg_manhattan_distance(left, right):
return K.exp(-K.sum(K.abs(left-right), axis=1, keepdims=True))

# The visible layer
left_input = Input(shape=(max_seq_length,), dtype='int32')
right_input = Input(shape=(max_seq_length,), dtype='int32')

embedding_layer = Embedding(len(embeddings), embedding_dim, weights=[embeddings], input_length=max_seq_length, trainable=False)

# Embedded version of the inputs
encoded_left = embedding_layer(left_input)
encoded_right = embedding_layer(right_input)

# Since this is a siamese network, both sides share the same LSTM
shared_lstm = LSTM(n_hidden)

left_output = shared_lstm(encoded_left)
right_output = shared_lstm(encoded_right)

# Calculates the distance as defined by the MaLSTM model
malstm_distance = Lambda(function=lambda x: exponent_neg_manhattan_distance(x[0], x[1]),output_shape=lambda x: (x[0][0], 1))([left_output, right_output])

# Pack it all up into a model
malstm = Model([left_input, right_input], [malstm_distance])

# Adadelta optimizer, with gradient clipping by norm
optimizer = Adadelta(clipnorm=gradient_clipping_norm)

malstm.compile(loss='mean_squared_error', optimizer=optimizer, metrics=['accuracy'])

Training and results

malstm_trained = malstm.fit([X_train['left'], X_train['right']], Y_train, batch_size=batch_size, nb_epoch=n_epoch,
validation_data=([X_validation['left'], X_validation['right']], Y_validation),
callbacks=[checkpointer])

Improvements