In this entry, we will be developing a deep learning algorithm - a sub-field of machine learning inspired by the structure of human brain (neural networks) - to classify images of single digit number (0-9).
(9 min read)
As we may have experienced, there has been a lot of progress made on developing computer vision systems that can recognize images such as the Facebook tagging system, speech recognition (ever use Google home or Alexa?), and even medical image analysis of cancer cell or injury from X-ray images. A lot of the mentioned technologies are based on deep learning, which is a machine learning technique that teach machines to go through similar learning process as humans (or as close as it can be). For this entry, it is basically like I am teaching toddlers to recognize numbers, but they learn at a much faster rate because they are machines (well, I don’t have to feed them, and they don’t cry when I ordered them to go through like 1000 math questions).
The adjective “deep” comes from the fact that the structure of this algorithm comprises of multiple layers of network between the input and the output, and that we could hardly explain what is going on in the middle of the process. We only know if the computer learns or not; for that, this machine learning approach is also known as the black box approach.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
import seaborn as sns
2)
np.random.seed(
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
import itertools
from keras.utils.np_utils import to_categorical # convert to one-hot-encoding
from keras.models import Sequential
from keras.layers import Dense, Dropout, Flatten, Conv2D, MaxPool2D
from tensorflow.keras.optimizers import RMSprop
from keras.preprocessing.image import ImageDataGenerator
from keras.callbacks import ReduceLROnPlateau
set(style='white', context='notebook', palette='deep') sns.
We will load the training and testing data set to train the machine and test whether the machine has learned as intended. Normally, data for machine learning are in forms of texts or numbers, but if our algorithm is complex enough, it can even process images or sounds.
I have also plotted the distribution of all training digit images below (0 to 9). The training data set has 42,000 images and the testing data set has 28,000 images.
# Load the data
= pd.read_csv("train.csv")
train = pd.read_csv("test.csv")
test
= train["label"]
Y_train
# Drop 'label' column
= train.drop(labels = ["label"],axis = 1)
X_train
= sns.countplot(Y_train) g
C:\Users\tarid\ANACON~1\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
warnings.warn(
plt.show()
Y_train.value_counts()
#Check for missing value
# Check the data
1 4684
7 4401
3 4351
9 4188
2 4177
6 4137
0 4132
4 4072
8 4063
5 3795
Name: label, dtype: int64
any().describe() X_train.isnull().
count 784
unique 1
top False
freq 784
dtype: object
any().describe()
test.isnull().
#Normalization
# Normalize the data
count 784
unique 1
top False
freq 784
dtype: object
= X_train / 255.0
X_train = test / 255.0
test
# Reshape image in 3 dimensions (height = 28px, width = 28px , canal = 1)
= X_train.values.reshape(-1,28,28,1)
X_train = test.values.reshape(-1,28,28,1)
test
# Encode labels to one hot vectors (ex : 2 -> [0,0,1,0,0,0,0,0,0,0])
= to_categorical(Y_train, num_classes = 10) Y_train
We will also take a portion of training data set out for validation purpose as well. We will call it, the validation set. The validation data set is usually used to estimate how well the model has been trained. Basically, it is like a mock exam for the machine before the real test with the testing set, so that teachers will know which kid (or in our case, algorithm) is the brighest among the cohort.
We will also print a sample of images below. It is a simple number image.
# Set the random seed
= 2
random_seed
# Split the train and the validation set for the fitting
= train_test_split(X_train, Y_train, test_size = 0.1, random_state=random_seed)
X_train, X_val, Y_train, Y_val
# Some example
= plt.imshow(X_train[0][:,:,0])
g plt.show()
Now comes the main part. We will build a machine architecture of five layers convolutional neural network - a class of artificial neural network that takes after a human brain - to perform digit recognition.
Here, we will add one-layer at a time starting from the input layer (remember the picture above). It is like peeling an onion, but instead of peeling, you are sticking each layer together to form one (I know it is not a good analogy).
Then, we will add a convolutional (Conv2D) layer as the first layer with 32 filters to filter the number image. The second layer also has 32 filters, and the last two layers have 64 filters. Each filter transforms a part of the image for the machine to memorize.
ATTN NERDS: The second important layer in CNN is the pooling (MaxPool2D
) layer for the machine to look at the two neighboring pixels in a picture and picks the maximal value, so that it has more clue to classify which image contains which number. This technique can be used to reduce overfitting (where the machine learns too much of the training data and unable to use what it learns with the actual test) and computational cost.
ATTN NERDS: We will then use the relu
activation function to add non-linearity to the network and use the Flatten layer is use to convert the final feature maps into a one single 1D vector. At the end, the two fully-connected (Dense) layers were added as artificial an neural networks (ANN) classifiers with the softmax
activation function to compute the probability distribution of each class.
# my CNN architechture is In -> [[Conv2D->relu]*2 -> MaxPool2D -> Dropout]*2 -> Flatten -> Dense -> Dropout -> Out
= Sequential()
model
= 32, kernel_size = (5,5),padding = 'Same',
model.add(Conv2D(filters ='relu', input_shape = (28,28,1)))
activation = 32, kernel_size = (5,5),padding = 'Same',
model.add(Conv2D(filters ='relu'))
activation =(2,2)))
model.add(MaxPool2D(pool_size0.25))
model.add(Dropout(
= 64, kernel_size = (3,3),padding = 'Same',
model.add(Conv2D(filters ='relu'))
activation = 64, kernel_size = (3,3),padding = 'Same',
model.add(Conv2D(filters ='relu'))
activation =(2,2), strides=(2,2)))
model.add(MaxPool2D(pool_size0.25))
model.add(Dropout(
model.add(Flatten())256, activation = "relu"))
model.add(Dense(0.5))
model.add(Dropout(10, activation = "softmax"))
model.add(Dense(
# Summarize the model
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
conv2d (Conv2D) (None, 28, 28, 32) 832
conv2d_1 (Conv2D) (None, 28, 28, 32) 25632
max_pooling2d (MaxPooling2D (None, 14, 14, 32) 0
)
dropout (Dropout) (None, 14, 14, 32) 0
conv2d_2 (Conv2D) (None, 14, 14, 64) 18496
conv2d_3 (Conv2D) (None, 14, 14, 64) 36928
max_pooling2d_1 (MaxPooling (None, 7, 7, 64) 0
2D)
dropout_1 (Dropout) (None, 7, 7, 64) 0
flatten (Flatten) (None, 3136) 0
dense (Dense) (None, 256) 803072
dropout_2 (Dropout) (None, 256) 0
dense_1 (Dense) (None, 10) 2570
=================================================================
Total params: 887,530
Trainable params: 887,530
Non-trainable params: 0
_________________________________________________________________
After the model is constructed, we need to set up how we will optimize the algorithm as well as how to evaluate its result. We define the loss function to measure how poorly our model performs on images with known labels with the categorical_crossentropy
method.
Here, we are setting epochs
to 2 to have the machine go through the entire data set exactly TWO times. It’s like you order your kids to do the homework, then erase them all and have them do it again entirely after they are done (I know, it’s not a good parenting). We also set batch_size
to 86, which means 86 images will be present at a time to the machine. We cannot have the machine read 40,000 images at once, or you could if you have a god-level CPU (maybe NASA or MIT can provide you one).
We can have the machine go through the data more than two times (say, 30) for 99% accuracy, but that would take an hour or two to train it. For demonstration purposes, I will request for only two training rounds.
#%% Set the optimizer and annealer
# Define the optimizer
= RMSprop(learning_rate=0.001, rho=0.9, epsilon=1e-08, decay=0.0)
optimizer
# Compile the model
compile(optimizer = optimizer , loss = "categorical_crossentropy", metrics=["accuracy"])
model.
# Set a learning rate annealer
= ReduceLROnPlateau(monitor='val_loss',
learning_rate_reduction =3,
patience=1,
verbose=0.5,
factor=0.00001)
min_lr
= 2 # Turn epochs to 30 to get 0.9967 accuracy
epochs = 86 batch_size
To avoid overfitting, which makes the machine learn more than it needs to, we will augment the image data by alter the training image a bit to reproduce the variations occurring when someone is writing a digit. You know, when you want your kids to learn well, you make the homework a little bit challenging.
By applying just a couple of these transformations to our training data, we create a robust model and got a higher accuracy in the result.
# With data augmentation to prevent overfitting (accuracy 0.99286)
= ImageDataGenerator(
datagen =False, # set input mean to 0 over the dataset
featurewise_center=False, # set each sample mean to 0
samplewise_center=False, # divide inputs by std of the dataset
featurewise_std_normalization=False, # divide each input by its std
samplewise_std_normalization=False, # apply ZCA whitening
zca_whitening=10, # randomly rotate images in the range (degrees, 0 to 180)
rotation_range= 0.1, # Randomly zoom image
zoom_range =0.1, # randomly shift images horizontally (fraction of total width)
width_shift_range=0.1, # randomly shift images vertically (fraction of total height)
height_shift_range=False, # randomly flip images
horizontal_flip=False) # randomly flip images
vertical_flip
datagen.fit(X_train)
# Fit the model
= model.fit(datagen.flow(X_train,Y_train, batch_size=batch_size),
history = epochs, validation_data = (X_val,Y_val),
epochs = 2, steps_per_epoch=X_train.shape[0] // batch_size
verbose =[learning_rate_reduction]) , callbacks
Epoch 1/2
439/439 - 127s - loss: 0.4209 - accuracy: 0.8659 - val_loss: 0.0737 - val_accuracy: 0.9774 - lr: 0.0010 - 127s/epoch - 290ms/step
Epoch 2/2
439/439 - 130s - loss: 0.1301 - accuracy: 0.9622 - val_loss: 0.0418 - val_accuracy: 0.9862 - lr: 0.0010 - 130s/epoch - 297ms/step
loss
(or error) is going down while the accuracy is going up. With more training round (epochs), the accuracy would go even higher.# Plot the loss and accuracy curves for training and validation
= plt.subplots(2,1)
fig, ax 0].plot(history.history['loss'], color='b', label="Training loss")
ax[0].plot(history.history['val_loss'], color='r', label="validation loss",axes =ax[0])
ax[= ax[0].legend(loc='best', shadow=True)
legend
1].plot(history.history['accuracy'], color='b', label="Training accuracy")
ax[1].plot(history.history['val_accuracy'], color='r',label="Validation accuracy")
ax[= ax[1].legend(loc='best', shadow=True)
legend plt.show()
#%% Confusion matrix
def plot_confusion_matrix(cm, classes,
=False,
normalize='Confusion matrix',
title=plt.cm.Blues):
cmap"""
This function prints and plots the confusion matrix.
Normalization can be applied by setting `normalize=True`.
"""
='nearest', cmap=cmap)
plt.imshow(cm, interpolation
plt.title(title)
plt.colorbar()= np.arange(len(classes))
tick_marks =45)
plt.xticks(tick_marks, classes, rotation
plt.yticks(tick_marks, classes)
if normalize:
= cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
cm
= cm.max() / 2.
thresh for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
plt.text(j, i, cm[i, j],="center",
horizontalalignment="white" if cm[i, j] > thresh else "black")
color
plt.tight_layout()'True label')
plt.ylabel('Predicted label')
plt.xlabel(
# Predict the values from the validation dataset
= model.predict(X_val)
Y_pred # Convert predictions classes to one hot vectors
= np.argmax(Y_pred,axis = 1)
Y_pred_classes # Convert validation observations to one hot vectors
= np.argmax(Y_val,axis = 1)
Y_true # compute the confusion matrix
= confusion_matrix(Y_true, Y_pred_classes)
confusion_mtx # plot the confusion matrix
= range(10))
plot_confusion_matrix(confusion_mtx, classes
plt.show()
#%% Check error result
# Display some error results
# Errors are difference between predicted labels and true labels
= (Y_pred_classes - Y_true != 0)
errors
= Y_pred_classes[errors]
Y_pred_classes_errors = Y_pred[errors]
Y_pred_errors = Y_true[errors]
Y_true_errors = X_val[errors]
X_val_errors
def display_errors(errors_index,img_errors,pred_errors, obs_errors):
""" This function shows 6 images with their predicted and real labels"""
= 0
n = 2
nrows = 3
ncols = plt.subplots(nrows,ncols,sharex=True,sharey=True, figsize=(10,10))
fig, ax for row in range(nrows):
for col in range(ncols):
= errors_index[n]
error 28,28)))
ax[row,col].imshow((img_errors[error]).reshape(("Predicted label :{}\nTrue label :{}".format(pred_errors[error],obs_errors[error]))
ax[row,col].set_title(+= 1
n
# Probabilities of the wrong predicted numbers
= np.max(Y_pred_errors,axis = 1)
Y_pred_errors_prob
# Predicted probabilities of the true values in the error set
= np.diagonal(np.take(Y_pred_errors, Y_true_errors, axis=1))
true_prob_errors
# Difference between the probability of the predicted label and the true label
= Y_pred_errors_prob - true_prob_errors
delta_pred_true_errors
# Sorted list of the delta prob errors
= np.argsort(delta_pred_true_errors)
sorted_dela_errors
# Top 6 errors
= sorted_dela_errors[-6:]
most_important_errors
# Show the top 6 errors
display_errors(most_important_errors, X_val_errors, Y_pred_classes_errors, Y_true_errors) plt.show()
Deep learning is a very powerful approach to develop advanced tools that go beyond the capability of traditional machine learning tasks in analyzing unstructured data such as images, videos, and sounds - including musics. With robust tools developed based on this algorithm, we could significantly reduce workload for human labor.
The thing is, there are also numerous down sides of this algorithm as well. We would need a lot of data to develop an accurate model, and I mean a lot as in ten thousand or more. Even with that much data, the accuracy could drop as well as we, humans, change our way of living (like how we changed from watching cable TV to Netflix); then, we would need more data to train or even a new machine architecture to predict new features that we are interested in.
The explainability (the ability to be explained) of the model is clear as mud (TL: It is NOT clear). I mean, yes, we know that the machine can learn and classify number images, but we don’t actually know step-by-step procedures of how it works on the inside. For normal machine learning tasks such as logistic regression, there is a mathematics formula or two for those who are curious on how what makes the machine ticks, but that doesn’t seem to be the case for deep learning. See explanable AI for how experts remedy this problem.
The explainability drawback is especially critical for us academics. While we seek results, we also care to explain how we obtained that results in the process. That is why we tend to receive a lot of pushbacks when we use complex models that we cannot explain; for this reason, simpler model is preferred if it can accomplish the task with similar, if not the same, performance. Explanability is usually (a little bit) preferred over accuracy in our case.
Nevertheless, it is also important to know how the algorithm works (roughly), so that we are aware of what makes Facebook able to automatically tag our friends or show you advertisements about the thing you talked about, what makes Shazam able to identify musics and movies, and so we can potentially find the right tool should we need to accomplish similar tasks. That is why I wrote this entry to share what I know about this topic. Thank you very much for reading this. Have a good one!
Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Wongvorachan (2021, Dec. 7). Tarid Wongvorachan: Image Recognition with Artificial Neural Networks. Retrieved from https://taridwong.github.io/posts/2021-12-07-image-recognition-with-artificial-neural-networks/
BibTeX citation
@misc{wongvorachan2021image, author = {Wongvorachan, Tarid}, title = {Tarid Wongvorachan: Image Recognition with Artificial Neural Networks}, url = {https://taridwong.github.io/posts/2021-12-07-image-recognition-with-artificial-neural-networks/}, year = {2021} }