Have you ever wondered how facial recognition works?
With text or number-based tasks in machine learning, we can teach our machine what is important; for example, if I wanted to predict your income, I could teach the machine to learn from important "features", like your education, age, and career. What about images? What "features" can a machine learn to recognize your face?
This analysis uses the Yale Face Database to generate and examine features that can predict whose face is in a given picture. The goal is to build an intuition for RBM output on image datasets.
You can fork this project from my github repo here.
The Yale Face Database contains 165 grayscale images in GIF format of 15 individuals. There are 11 images per subject, one per different facial expression or configuration: center-light, w/glasses, happy, left-light, w/no glasses, normal, right-light, sad, sleepy, surprised, and wink.
Credit also goes to the creators of this normalized version of this dataset. The Centered versions of the images are what are used in the below analysis. Here's a few selected examples:
Simple - pixels! More specifically, computers see a huge matrix of numbers that describe the color content of each pixel in an image. These images are gray-scale, so each pixel is assigned a number that describes how bright the pixel is:
If the image is colored, it's only a little more complicated: each pixel gets 4 numbers representing the RGBA color channels. These 4 numbers would describe how much Red, Blue, Green, and Alpha (transparency) compose that pixel. Most operations done on images are therefore just linear algebra computations.
Again, the goal is to have some features we can teach a machine to look for. Since a machine just sees a matrix, we need some way for it to learn in broad strokes - we don't want to do anything pixel-by-pixel. This is where Deep Learning comes in.
A Restricted Boltzmann Machine (RBM) is a Neural Network with only 2 layers: One visible, and one hidden.
The visible layer is the inputs; in this case, the images. The hidden layer will ultimately become information about useful features if training is successful.
For a deeper dive into how RBM's work, I like this video; for now, here's a simpler way to think about it. The input layer sends our image data to the hidden layer, and the hidden layer tries to describe it back to the input layer. The catch is that the hidden layer is "restricted"; the nodes in that layer cannot communicate with each other; each can only communicate with the input layer's nodes.
Like other neural nets, training involves some number of these epochs. Each epoch, the input layer uses some loss function to describe how poorly the hidden layer is doing and suggest improvements to each node.
The hope is that after enough training passes, each of the nodes in the hidden layer will have "focused" on certain image characteristics that contain information about what's in the image.
You could think about it this way: perhaps one node could learn to focus on eye color, one on nose shape, etc... That wouldn't be too far off conceptually, except that the features found by nodes are unlikely to be so intuitively human, as we'll see soon.
Once trained, the hidden layer will hold features that can be used to make predictions about brand new images. Let's see a Python implementation.
We can use skimage.io to read the images into memory:
"imgs" is now a list of arrays that each represent an image.
Before feeding these into an RBM, consider resizing them. Right now they are 231x65, or over 38 thousand pixels each. If we shrink to 77x65, we reduce that to about 5 thousand pixels, which can significantly improve the speed of training. Finally, we want to flatten our input from a 77x65 to a single 2D list of 1x5005, so the RBM receives a flat input layer:
Next, we can set up a Pipeline that feeds the data through an RBM, and then the output can be used by a Logistic Regression to classify the different faces.
We'll use a learning rate of .01. Just like any other neural net, this refers to how quickly the nodes react to "suggestions" from the loss function. In this case we'll create 150 nodes in the hidden layer, which means we'll get 150 features to use to predict who's in each picture.
Before we actually feed this pipeline to our images, we also need to feed it labels, as Logistic Regression is a supervised learning technique. The Yale images are named in a certain order, so you can create a target variable this way:
We're ready to train the pipeline:
Now we have a model. Before we take a closer look at what the RBM has done, we can predict the training data. (Overfitting applies here, so as always, testing a model on the same data it trained on is poor practice. This project is really about understanding RBM's and facial recognition, so we'll take the shortcut.)
You can see the full output of this report here, but this model correctly identifies the person in 96% of the training images.
With the model fitted, we can access the hidden layer by calling the components attribute. As a reminder, these will be computer-representations of images: big matrices. The below code accesses all 150 components and converts them to images we can examine.
We can take a closer look at a few of these:
For some hidden components, like the left and center examples above, it's intuitive to see how they are storing information about image content, and how they contribute to recreations of the original (during training) or predictions. Others, like the right, are less intuitive. This is NOT an indication of usefulness; even if a hidden component is not intuitively explaining patterns to the human eye, it can still contain information that a machine can process, and can combine with other hidden components to make accurate predictions.