Models That Shaped Deep Learning: 2015-2016

ResNet

As deep neural networks are both time-consuming to train and prone to overfitting, a team at Microsoft introduced a residual learning framework to improve the training of networks that are substantially deeper than those used previously. This research was published in the paper titled Deep Residual Learning for Image Recognition in 2015. And so, the famous ResNet (short for “Residual Network”) was born.

When training deep networks, there comes a point where an increase in depth causes accuracy to saturate, then degrade rapidly. This is called the “degradation problem.” This highlights that not all neural network architectures are equally easy to optimize.

ResNet uses a technique called “residual mapping” to combat this issue. Instead of hoping that every few stacked layers directly fit a desired underlying mapping, the Residual Network explicitly lets these layers fit a residual mapping. Below is the building block of a Residual network.

// Building block of a Residual Network

The formulation of F(x)+x can be realized by feedforward neural networks with shortcut connections.

Many problems can be addressed using ResNets. They are easy to optimize and achieve higher accuracy when the depth of the network increases, producing results that are better than previous networks. ResNet was first trained and tested on ImageNet’s over 1.2 million training images belonging to 1000 different classes.

ResNet Architecture

Compared to the conventional neural network architectures, ResNets are relatively easy to understand. Below is the image of a VGG network, a plain 34-layer neural network, and a 34-layer residual neural network. In the plain network, for the same output feature map, the layers have the same number of filters. If the size of output features is halved, the number of filters is doubled, making the training process more complex.

// ResNet Architecture Visualization

Meanwhile in the Residual Neural Network, as we can see, there are far fewer filters and lower complexity during the training with respect to VGG. A shortcut connection is added that turns the network into its counterpart residual version. This shortcut connection performs identity mapping, with extra zero entries padded for increasing dimensions. This option introduces no additional parameter. The projection shortcut is mathematically represented as F(x{W}+x), which is used to match dimensions computed by 1×1 convolutions.

// Table of ResNet Architectures

Each ResNet block is either two layers deep (used in small networks like ResNet 18 or 34), or three layers deep (ResNet 50, 101, or 152).

ResNet Training and Results

The samples from the ImageNet dataset are re-scaled to 224 × 224 and are normalized by a per-pixel mean subtraction. Stochastic gradient descent is used for optimization with a mini-batch size of 256. The learning rate starts from 0.1 and is divided by 10 when the error increases, and the models are trained up to 60 × 10⁴ iterations. The weight decay and momentum are set to 0.0001 and 0.9 respectively. Dropout layers are not used.

ResNet performs extremely well with deeper architectures. Below is an image showing the error rate of two 18 and 34-layer neural networks. On the left the graph shows plain networks, while the graph on the right shows their ResNet equivalents. The thin red curve in the image represents the training error, and the bold curve represents the validation error.

// Graphs of ResNet Training

Below is the table showing the Top-1 error (%, 10-crop testing) on ImageNet validation.

// ImageNet Validation Error Table

ResNet has played a significant role in defining the field of deep learning as we know it today.

Wide ResNet

The Wide Residual Network is a more recent improvement on the original Deep Residual Networks. Rather than relying on increasing the depth of a network to improve its accuracy, it was shown that a network could be made shallower and wider without compromising its performance. This ideology was presented in the paper Wide Residual Networks, published in 2016.

Wide ResNet Architecture

A Wide ResNet has a group of ResNet blocks stacked together, where each ResNet block follows the BatchNormalization-ReLU-Conv structure. This structure is depicted as follows:

// Wide ResNet Architecture Diagram

Wide ResNet Training and Results

Wide ResNet was trained on CIFAR-10. The following metrics resulted in the lowest error rates:

  • Convolution type: B(3, 3)
  • Convolution layers per residual block: 2
  • Width of residual blocks: A depth of 28 and a width of 10 seemed to be less error-prone.
  • Dropout: When dropout was included the error rate was further reduced.

The following table compares the complexity and performance of Wide ResNet with several other models, including the original ResNet, on both CIFAR-10 and CIFAR-100:

// CIFAR-10 and CIFAR-100 Results Table

Inception v3

Inception v3 mainly focuses on burning less computational power by modifying the previous Inception architectures. This idea was proposed in the paper Rethinking the Inception Architecture for Computer Vision, published in 2015. It was co-authored by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, and Jonathon Shlens.

In comparison to VGGNet, Inception Networks (GoogLeNet/Inception v1) have proved to be more computationally efficient, both in terms of the number of parameters generated by the network and the economical cost incurred (memory and other resources). If any changes are to be made to an Inception Network, care needs to be taken to make sure that the computational advantages aren’t lost. Thus, the adaptation of an Inception network for different use cases turns out to be a problem due to the uncertainty of the new network’s efficiency.

In an Inception v3 model, several techniques for optimizing the network have been suggested to loosen the constraints for easier model adaptation. The techniques include:

  • Factorized convolutions
  • Regularization
  • Dimension reduction
  • Parallelized computations

Inception v3 Architecture

The architecture of an Inception v3 network is progressively built, step-by-step, as explained below:

  1. Factorized Convolutions: This helps to reduce computational efficiency as it reduces the number of parameters involved in a network. It also keeps a check on the network efficiency.
  2. Smaller Convolutions: Replacing bigger convolutions with smaller convolutions leads to faster training. For example, a 5 × 5 filter with 25 parameters can be replaced by two 3 × 3 filters, which have only 18 parameters in total.
  3. Asymmetric Convolutions: A 3 × 3 convolution can be replaced by a 1 × 3 convolution followed by a 3 × 1 convolution.
  4. Auxiliary Classifier: An auxiliary classifier is a small CNN inserted between layers during training. The loss incurred is added to the main network loss.
  5. Grid Size Reduction: Grid size reduction is usually done by pooling operations. To combat the bottlenecks of computational cost, more efficient techniques have been proposed.

// Visualization of Inception v3 Techniques

Inception v3 Training and Results

Inception v3 was trained on ImageNet and compared with other contemporary models. As shown in the table below, when augmented with an auxiliary classifier, factorization of convolutions, RMSProp, and Label Smoothing, Inception v3 achieves the lowest error rates compared to its contemporaries.

// Inception v3 Training Results Table

SqueezeNet

SqueezeNet is a smaller network designed as a more compact replacement for AlexNet. It has almost 50x fewer parameters than AlexNet, yet it performs 3x faster. This architecture was proposed by researchers at DeepScale, The University of California, Berkeley, and Stanford University in 2016. It was first published in their paper titled SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.

Below are the key ideas behind SqueezeNet:

  • Use 1 × 1 filters instead of 3 × 3
  • Decrease the number of input channels to 3 × 3 filters
  • Downsample late in the network so that convolution layers have large activation maps

SqueezeNet Architecture and Results

The SqueezeNet architecture is comprised of “squeeze” and “expand” layers. A squeeze convolutional layer has only 1 × 1 filters. These are fed into an expand layer that has a mix of 1 × 1 and 3 × 3 convolution filters. This is shown below:

// SqueezeNet Fire Module Visualization

The authors of the paper use the term “fire module” to describe a squeeze layer and an expand layer together. An input image is first sent into a standalone convolutional layer. This layer is followed by eight “fire modules” which are named “fire2-9”.

Below is an image showing how SqueezeNet compares with the original AlexNet:

// SqueezeNet vs AlexNet Comparison

SqueezeNet makes the deployment process easier due to its small size. Initially, this network was implemented in Caffe, but the model has since gained popularity and has been adopted on many different platforms.

Conclusion

The models discussed here—ResNet, Wide ResNet, Inception v3, and SqueezeNet—played a significant role in shaping the field of deep learning as we know it today. Each brought forward innovative ideas that improved both performance and computational efficiency, pushing the boundaries of what neural networks can achieve.

Create a Free Account

Register now and get access to our Cloud Services.

Posts you might be interested in: