Hello there, visitor! This is the project page for my first TensorFlow model, a digit classifier. My awesome teammate Tigran Avetisyan and I worked on this project throughout the summer after my freshman year, hoping to sharpen our data and ML skills. We came up with a convolutional neural network that could classify digits correctly 99.6% of the time. Along the way, we implemented a custom layer and a custom callback which do some exciting stuff. I hope you enjoy the read; you can find links to the GitHub repository, the notebook, and the dataset we collected for this project at the bottom of this page.
As illustrated in the image below, a Preprocessing block,
a Convolutional chain, a Singularity Extracting chain, and a Cognitive block compose MSXCN.
The Preprocessing block prepares the data for consumption. During training, the data
passes through five consecutive layers: Reshape, RandomWidth, RandomTranslation, RandomZoom,
Resizing. However, during evaluation or production, the data only passes through the Reshape layer.
The Reshape layer reshapes the 2D tensor of size (n, 784) into (n, 28, 28, 1), n being
the number of observations per batch. The RandomWidth layer scales the image along the x-axis
by a random amount. The RandomTranslation layer translates the image to a random location.
The RandomZoom layer zooms the image by a random amount. The Resizing layer resizes the image
back to 28x28. The data is then sent off to be processed by the Convolutional and Singularity
Extracting chains.
The Convolutional Chain is designed to learn and detect patterns that correspond to each
digit. It, as illustrated below, is built with 11 building blocks: 3 convolutional blocks, 3
batch norm., 3 dropout layers, a global average pooling layer, and a redundant yet useful
flattening layer.
The dropout layer discards some of the connections between layers, thus countering
overfit. The batch normalization layer standardizes its inputs. The standardization of inputs
helps the model learn faster as it does not have to spend numerous iterations adjusting
each of the weights to account for extreme input values. The global average pooling layer
maps the average of each spatial feature to a category confidence map which significantly
reduces computational needs.
The flatten layer assures that all output from the chain is flattened.
The Convolutional Block is a series of sequentially connected convolutional, batch
normalization, and max-pooling layers. During the instantiation of a block, we supply arguments
upon which it generates and connects layers. Specifying the depth parameter causes the block to
generate just as many "Convolution-Batch Norm" pairs, only leaving the last convolutional layer
without a batch norm partner. If the pool argument is True, it generates a max-pooling layer of
second degree; otherwise, it increases the strides of the last convolutional layer to 2. Below
is a convolutional block with depth 2 and pooling enabled.
The convolutional layer applies a convolution operation (basically a dot product) on
the input matrix using kernels. The weights of the kernels are inferred during training which
allows the model to find useful repeating spatial patterns in the inputs. The animation below
shows how the kernel (dark 3x3 area) passes above each input pixel (blue squares), applying a
convolution operation to nearby pixels, thus computing the output (green matrix).
The max-pooling layer helps us reduce computational needs by down-sampling the input.
The Singularity eXtractor layer accentuates non-uniform feature localities that
otherwise a convolutional block might miss. An example of this would be the array [10,
10, 2, 10, 10], where the feature in the middle differs from the rest significantly, this
significant difference would be accentuated by the singularity extractor, and it would
output an array that would look like this: [0.13, 0.08, 0.94, 0.08, 0.13]. The formula
below defines the Singularity Extracting operation of kernel size 3x3.
The Cognitive Block, as shown below, consists of concatenate, dense, batch norm,
and dropout layers. The concatenate layer merges the outputs of the Convolutional and
Singularity Extracting chains. The dense layers are just a bunch of linear functions
with a touch of non-linearity (in this case, a leaky rectified linear unit, tanh, and
softmax).
Two datasets were used to train and validate the model, but only one to test it. The
larger dataset was the MNIST digits dataset containing 70,000 handwritten digits
in total. 28,000 observations were used to test; the rest were merged with the second
dataset. The second dataset is the Digits Mini Dataset containing 5500 digits
drawn on the canvas you saw before (the black square). We collected and published this
dataset; it is available for free on Kaggle (link below). The joint dataset was then
split into 90% training and 10% validation sets.
The model was optimized using an RMSprop optimizer. A callback of ReduceLROnPlateu
was implemented to help the model converge faster by decreasing the learning rate by a
factor of 3 every 3 epochs of no improvement. A ModelCheckpoint was used to save the
model every time it reached a new extremum. The custom-written GateOfLearning callback
was used to kick the model out of local extrema, hoping it would converge to a better
one. In the illustration below, we can see how the model can sometimes get stuck in local
extrema. The goal of GateOfLearning is to kick the model into the air and hope it lands
in better extrema. This goal is often not met.
Moreover, GateOfLearning can cause under/overfit on many occasions. It can also cause
the learning rate to diverge into infinity (which is moderated by an exception raise)
when the patience and factor values overpower the optimizer and ReduceLROnPlateau combined.
However, in the right hands and with a pint of luck, it might push the model to a global
extremum.
Click here to visit the GitHub repository
Click here to visit the Kaggle notebook
Click here to see the Digits Mini Dataset that was collected for this project