Mục Lục

Face Recognition using Siamese Network

This project aims to detect and recognize human faces in video streams. It can either be a video file or realtime feed from a webcam. MTCNN and Haar Cascades algorithms are utilized to detect and crop faces. Siamese Network is used to compare two faces and classify whether they are the same or not. Distance between face encodings generated by the Encoder network (Inception-ResNet-v1) is used as a metric to judge the similarity of two faces. The Encoder network is trained using the Triplet Loss, which requires efficient Triplet Mining.

A detailed description of this project along with the results can be found here.

Getting Started

Prerequisites

Running this project on your local system requires the following packages to be installed :

numpy
matplotlib
PIL
mtcnn
cv2
keras

They can be installed from the Python Package Index using pip as follows :

 pip install numpy
 pip install matplotlib
 pip install Pillow
 pip install mtcnn
 pip install opencv-python
 pip install Keras

You can also use Google Colab in a Web Browser without needing to install the mentioned packages.

Note: This project was implemented and tested in TensorFlow v1 and might not be compatible with the recent changes introduced in v2. If on Colab, you can specify the TensorFlow version using %tensorflow_version 1.x. You need to run a cell with this script before importing tensorflow or any other package having tensorflow as its dependacy.

Usage

This project is implemented as an interactive Jupyter Notebook. You just need to open the notebook on your local system or on Google Colab and execute the code cells in sequential order. The function of each code cell is properly explained with the help of comments.

Before executing Face_Recognition :

Create a folder named Face_database in the root directory.
Place images of known persons whom you want to recognize in this folder.

Also before starting you need to make sure that the path to various files and folders in the notebook are updated according to your working environment. If you are using Google Colab, then :

Mount Google Drive using :

 from google.colab import drive
 drive.mount('/content/drive')

Update file/folder locations as '/content/drive/path_to_file_or_folder'.

Tools Used

NumPy : Used for storing and manipulating high dimensional arrays.
Matplotlib : Used for plotting.
PIL : Used for loading image files.
MTCNN : Used for detecting and cropping faces.
OpenCV : Used for loading Haar Cascades and manipulating video streams.
Keras : Used for designing and training the Encoder model.
Google Colab : Used as the developement environment for executing high-end computations on its backend GPUs/TPUs and for editing Jupyter Notebook.

Contributing

You are welcome to contribute :

Fork it (https://github.com/rohanrao619/Face_Recognition_using_Siamese_Network/fork)
Create new branch : git checkout -b new_feature
Commit your changes : git commit -am 'Added new_feature'
Push to the branch : git push origin new_feature
Submit a pull request !

Future Work

Liveliness Detection
Speed Optimization

License

This Project is licensed under the MIT License, see the LICENSE file for details.

Project Description and Results

Face Detection

Multi-task Cascaded Convolutional Networks (MTCNN) is used for face detection during the training process as it provides an impressive accuracy. Haar Cascades is preferred during realtime applications as MTCNN is computationally slow. However this results in trading off a small amount of accuracy.

Encoder Network

Inception ResNet v1 is used as the Encoder network for generating face encodings in this project. It expects 160x160x3 RGB images with pixel values normalized across all the 3 channels, to generate 128 dimensional face encodings.

The detailed architecture of the Encoder network can be found here.

Training the Encoder Network

Encoder network is trained using the following trainer model :

This trainer model is fed with a batch of Triplets. A Triplet is a set of 3 faces (Anchor, Positive and Negative). Anchor and Positive are faces of the same person, whereas Negative is the face of another person. This Trainer model tries to minimize the Triplet Loss (calculated using face encodings generated by the Encoder network for this batch).

Triplet Loss

Triplet loss for a batch of Triplets is calculated as :

Here m denotes the no. of Triplets in the batch, (anchor, positive and negative) superscript i are face encodings (generated by the Encoder network) for the ith Triplet in the batch.

Alpha is the least margin by which the two distances should be separated.

distance is the Euclidean distance between the two 128 dimensional encodings. Before calculating this distance, each encoding is L2 Normalized.

Triplet Mining

There are 3 kind of Triplets :

Easy Triplet : distance(anchor,positive) + alpha < distance(anchor,negative)
Semi-hard Triplet : distance(anchor,positive) < distance(anchor,negative) < distance(anchor,positive) + alpha
Hard Triplet : distance(anchor,negative) < distance(anchor,positive)

It can be seen that Easy triplets have loss=0, making them useless for training the Encoder network. So mostly Hard and Semi-hard Triplets are desired for training the Encoder network. Triplet Mining is therefore required to find triplets having maximum impact on the training process.

Siamese Network

As shown, two faces are compared to find if they belong to the same person or not. A given input face is checked against all the faces present in the Face_database (faces of known persons). The person having minimum distance to the input face is identified as the target (if distance < threshold). Threshold value needs to be tuned according to the application.

Results

For illustration, the Encoder network was trained using triplets generated from a small subset of Labelled Faces in the Wild (LFW) dataset. It consists of 200 images of 20 celebrities (10 images per person). You can see this dataset here.

Batches of size 100 with 75 hard and 25 random triplets (to respect the dataset distribution) were drawn. A Margin of 0.5 was used. The model was trained for 100 epochs using Adam Optimizer with a learning rate of 0.0005, producing the following result :

The trained Encoder network was able to produce decent results considering the fact that it was trained on such a small dataset. Few Triplets with their positive and negative distances (Calculated using the trained Encoder network) :

Webcam Demo

As you may have guessed by now, training an Encoder network requires a huge dataset and a lot of computational power. It is obviously not possible on our local machines. So I used pre-trained weights for the Inception ResNet v1 Encoder network provided by David Sandberg during realtime face recognition on video streams.

As it can be seen, our Face Recognizer Model is pretty good in detecting multiple faces and recognizing them. But it recognizes only me and not Dr. Andrew Ng. Why? because he is not in the Face_database !