VGGNet

VGGNet
Developer(s)	Visual Geometry Group
Initial release	September 4, 2014; 10 years ago
Written in	Caffe (software)
Type	Convolutional neural network; Deep neural network;
License	CC BY 4.0
Website	www.robots.ox.ac.uk/~vgg/research/very_deep/

The VGGNets are a series of convolutional neural networks (CNNs) developed by the Visual Geometry Group (VGG) at the University of Oxford.

The VGG family includes various configurations with different depths, denoted by the letter "VGG" followed by the number of weight layers. The most common ones are VGG-16 (13 convolutional layers + 3 fully connected layers, 138M parameters) and VGG-19 (16 + 3, 144M parameters).^[1]

The VGG family were widely applied in various computer vision areas.^[2] An ensemble model of VGGNets achieved state-of-the-art results in the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2014.^[1]^[3] It was used as a baseline comparison in the ResNet paper for image classification,^[4] as the network in the Fast Region-based CNN for object detection, and as a base network in neural style transfer.^[5]

The series was historically important as an early influential model designed by composing generic modules, whereas AlexNet (2012) was designed "from scratch". It was also instrumental in changing the standard convolutional kernels in CNN from large (up to 11-by-11 in AlexNet) to just 3-by-3, a decision that was only revised in ConvNext (2022).^[6]^[7]

VGGNets were mostly obsoleted by Inception, ResNet, and DenseNet. RepVGG (2021) is an updated version of the architecture.^[8]

Architecture

The key architectural principle of VGG models is the consistent use of small $3\times 3$ convolutional filters throughout the network. This contrasts with earlier CNN architectures that employed larger filters, such as $11\times 11$ in AlexNet.^[7]

For example, two ${\textstyle 3\times 3}$ convolutions stacked together has the same receptive field pixels as a single ${\textstyle 5\times 5}$ convolution, but the latter uses ${\textstyle \left(25\cdot c^{2}\right)}$ parameters, while the former uses ${\textstyle \left(18\cdot c^{2}\right)}$ parameters (where $c$ is the number of channels). The original publication showed that deep and narrow CNN significantly outperform their shallow and wide counterparts.^[7]

The VGG series of models are deep neural networks composed of generic modules:

Convolutional modules: $3\times 3$ convolutional layers with stride 1, followed by ReLU activations.
Max-pooling layers: After some convolutional modules, max-pooling layers with a $2\times 2$ filter and a stride of 2 to downsample the feature maps. It halves both width and height, but keeps the number of channels.
Fully connected layers: Three fully connected layers at the end of the network, with sizes 4096-4096-1000. The last one has 1000 channels corresponding to the 1000 classes in ImageNet.
Softmax layer: A softmax layer outputs the probability distribution over the classes.

The VGG family includes various configurations with different depths, denoted by the letter "VGG" followed by the number of weight layers. The most common ones are VGG-16 (13 convolutional layers + 3 fully connected layers) and VGG-19 (16 + 3), denoted as configurations D and E in the original paper.^[10]

As an example, the 16 convolutional layers of VGG-19 are structured as follows: ${\begin{aligned}&3\to 64\to 64&\xrightarrow {\text{downsample}} \\&64\to 128\to 128&\xrightarrow {\text{downsample}} \\&128\to 256\to 256\to 256\to 256&\xrightarrow {\text{downsample}} \\&256\to 512\to 512\to 512\to 512&\xrightarrow {\text{downsample}} \\&512\to 512\to 512\to 512\to 512&\xrightarrow {\text{downsample}} \end{aligned}}$ where the arrow $c_{1}\to c_{2}$ means a 3x3 convolution with $c_{1}$ input channels and $c_{2}$ output channels and stride 1, followed by ReLU activation. The $\xrightarrow {\text{downsample}}$ means a down-sampling layer by 2x2 maxpooling with stride 2.

Table of VGG models
Name	Number of convolutional layers	Number of fully connected layers	Parameter count
VGG-16	13	3	138M
VGG-19	16	3	144M

References

^ ^a ^b Simonyan, Karen; Zisserman, Andrew (2015-04-10), Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556
^ Dhillon, Anamika; Verma, Gyanendra K. (2020-06-01). "Convolutional neural network: a review of models, methodologies and applications to object detection". Progress in Artificial Intelligence. 9 (2): 85–112. doi:10.1007/s13748-019-00203-0. ISSN 2192-6360.
^ "ILSVRC2014 Results". image-net.org. Retrieved 2024-09-06.
^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770–778. arXiv:1512.03385. Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.
^ Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (2016). Image Style Transfer Using Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2414–2423.
^ Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 11976–11986. arXiv:2201.03545.
^ ^a ^b ^c Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.2. Networks Using Blocks (VGG)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.
^ Ding, Xiaohan; Zhang, Xiangyu; Ma, Ningning; Han, Jungong; Ding, Guiguang; Sun, Jian (2021). "RepVGG: Making VGG-Style ConvNets Great Again". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 13733–13742. arXiv:2101.03697.
^ Lin, Min; Chen, Qiang; Yan, Shuicheng (2013). "Network In Network". arXiv:1312.4400 [cs.NE].
^ "Very Deep Convolutional Networks for Large-Scale Visual Recognition". Computer Vision group from the University of Oxford. Retrieved 2024-09-06.

[:1-1] Simonyan, Karen; Zisserman, Andrew (2015-04-10), Very Deep Convolutional Networks for Large-Scale Image Recognition, arXiv:1409.1556

[2] Dhillon, Anamika; Verma, Gyanendra K. (2020-06-01). "Convolutional neural network: a review of models, methodologies and applications to object detection". Progress in Artificial Intelligence. 9 (2): 85–112. doi:10.1007/s13748-019-00203-0. ISSN 2192-6360.

[3] "ILSVRC2014 Results". image-net.org. Retrieved 2024-09-06.

[4] He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2016). "Deep Residual Learning for Image Recognition". Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR): 770–778. arXiv:1512.03385. Bibcode:2016cvpr.confE...1H. doi:10.1109/CVPR.2016.90. ISBN 978-1-4673-8851-1.

[5] Gatys, Leon A.; Ecker, Alexander S.; Bethge, Matthias (2016). Image Style Transfer Using Convolutional Neural Networks. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 2414–2423.

[6] Liu, Zhuang; Mao, Hanzi; Wu, Chao-Yuan; Feichtenhofer, Christoph; Darrell, Trevor; Xie, Saining (2022). "A ConvNet for the 2020s". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 11976–11986. arXiv:2201.03545.

[:0-7] Zhang, Aston; Lipton, Zachary; Li, Mu; Smola, Alexander J. (2024). "8.2. Networks Using Blocks (VGG)". Dive into deep learning. Cambridge New York Port Melbourne New Delhi Singapore: Cambridge University Press. ISBN 978-1-009-38943-3.

[8] Ding, Xiaohan; Zhang, Xiangyu; Ma, Ningning; Han, Jungong; Ding, Guiguang; Sun, Jian (2021). "RepVGG: Making VGG-Style ConvNets Great Again". Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR): 13733–13742. arXiv:2101.03697.

[9] Lin, Min; Chen, Qiang; Yan, Shuicheng (2013). "Network In Network". arXiv:1312.4400 [cs.NE].

[10] "Very Deep Convolutional Networks for Large-Scale Visual Recognition". Computer Vision group from the University of Oxford. Retrieved 2024-09-06.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]