Large CNNs...a brief on their evolution

The evolution story of CNNs is noteworthy. The image detection algorithms of the yesteryears could only detect best in five guesses with an error rate of higher than 25%. With the entrance of CNN into this race, things took a different turn - a staggering reduction to 3.57% in just 4 years, with each year seeing an evolved and more efficient CNN.

AlexNet was the 1st in the family of CNN that came into being. It has 8 different layers (including 2 fully connected layers) and its successor, i.e. ZFNet is similarly built with the only major difference that the latter uses same filter of 3 X 3 throughout its CNN structure. VGGNet saw more improved accuracy, yet the complications increased due to huge number of parameters (~122M). Other factors that led to the increase in layer complexity were computations, number of layers and filter sizes. This led to the genesis of GoogLeNet, which deploys a smart way of reducing complication through Inception Module.

In an Inception module, the dimension (length and breadth) of layers are preserved by either

  1. Using a 1 X 1 filter to reduce the depth of the output layer at an added advantage of reduced computations
  2. If higher dimension filter (say 3 X 3) is used, then suitable padding (1) and stride (1) factors are used to retain the dimensions of output layer
  3. Pooling is done by using a proper stride and padding factor (say padding of 1 for a 3 X 3 max pooling filter; stride factor is always 1 here)

The last CNN layer in a GoogLeNet network is a 1 X 1 Average-pooling layer that significantly reduces the parameters, going into the 1st layer of fully connected network, than in VGGNet.

Next improved CNN network in line is ResNet. It is an improved version of VGGNet, where the additional layers have an identical element from the previous layer in order to preserve the gradients from running thin during back-propagation. This process of residual network is augmented by adding identity filters in the stacks of additional layers (from VGGNet) such that an identical layer is created at the end of those stacks – thereby preventing increase of total parameters in the network. The residual factor led to a significant reduction in error in all the image related tasks – classification, localization, detection and segmentation (i.e. identifying the contours of the object in an image).

1 Like