What is the benefit of using average pooling rather than …

  • What is the benefit of using average pooling rather than max pooling?

    What tools besides Python, R and SQL are all data scientists expected to know?

    This is a great question and is often a source of confusion for many aspiring data scientists. While R, Python, and SQL are arguably the top 3 most essential tools to learn as a d(Continue reading)

    Why do we perform pooling? Answer: To reduce variance, reduce computation complexity (as 2*2 max pooling/average pooling reduces 75% data) and extract low level features from neighbourhood. All these descriptions, I think, suit better with max-pooling. Isn’t it?

    Let’s have a look at this image:

    Max pooling extracts the most important features like edges whereas, average pooling extracts features so smoothly. For image data, you can see the difference. Although both are used for same reason, I think max pooling is better for extracting the extreme features. Average pooling sometimes can’t extract good features because it takes all into count and results an average value which may/may not be important for object detection type tasks.

    Note here, average pooling brings all into count and flows it to next layer which means all values actually are used for feature mapping and creating output – which is a very generalized computation. If you don’t need all inputs from Conv layer, you will get bad accuracy for average pooling.

    But, of course there are many classification tasks in github where average pooling has been used and outperformed max pooling (although I’m not sure this is because of using average pooling). So, again it depends on the type of dataset (basically I’m taking about images and their pixel density).

    So, to answer your question, I don’t think average pooling has any significant advantage over max-pooling. But, may be in some cases, where variance in a max pool filter is not significant, both pooling will give same type results. But in extreme cases, max-pooling will provide better results for sure.

    Added, with drop out pooling, all these doesn’t matter much, as drop out layers may vanish any single block. I have never seen any significant research paper to compare between pooling layers. But, there could be a few. You may search, read and better learn. Best of luck.


    From my experience average pooling prevents the network from learning the image structures such as edges and textures. A colleague of mine applied average pooling to image contrast learning and he found average pooling worked significantly better than max pooling in his case.


    View upvotes

    Now anyone can create new insights at the speed of thought.

    Use natural language query (NLQ) to create a new insight without knowing any SQL or table relations.

    To answer why pooling layers are used, read this – Sripooja Mallam’s answer to Why is the pooling layer used in a convolution neural network?

    1. Pooling layers control the number of features the CNN model is learning and it avoids over fitting.
    2. There are 2 different types of pooling layers – MAX pooling layer and AVG pooling layer. As the names suggest the MAX pooling layer picks maximum values from the convoluted feature maps and AVG pooling layer takes the average value of the features from the feature maps.
    3. This image explains when a MAX pooling works and when AVG pooling works:

    Image source link


    View upvotes

    I would add an additional argument – that max-pooling layers are worse at preserving localization. This is the flip-side of achieving translation-variance which they are (I assume) better at (one of the main reasons why they are used).

    When applying the max pooling operation, there is a very discrete type of operation taking place (is the feature found in the pooling neighborhood or not?).

    With mean operation, things are softer – if you think about pooling over an activations that crosses several pooling neighborhoods – an average would give you a strong signal in middle and soft at the edges, leaving you with more information on where the edges of the feature were localized (which is lost with max-pooling).

    Also from Zhou et al. 2016:

    “We believe that while the max and average functions are rather similar, the use of average pooling encourages the network to identify the complete extent of the object. The basic intuition behind this is that the loss for average pooling benefits when the network identifies all discriminative regions of an object as compared to max pooling”


    I’ll try to put it in simple words. I’m hoping that you know, how average pooling and max pooling are different. In simple words, max pooling rejects a big chunk of data and retains at max 1/4th. Average pooling on the other hand, do not reject all of it and retains more information, in comparison to max pooling. This is what usually believed to lead to better results. But it depends on the scenario as well.


    View upvotes

    Promoted by Learnbay Data Science

    Which is the best data science and AI certification for working professionals?

    Best data science and AI Certification For Working Professionals: Learnbay offers IBM certified data science & AI courses which are designed for working professionals. Based on your years of experience and your career goal, you can select the courses. To know more about whether your profile


    First, we use pooling so that we will be able to cover our entire image (with it’s receptive field) as quickly as possible (exponentially). If not, the number of parameters would be very high and so will be the time of computation. In order to achieve this, we generally down sample our images using pooling operations that helps us to grow our receptive field from local to more global quickly.

    But yes, pooling can be replaced by a special type of convolution known as dilated convolution. [1] Dilated convolutions are convolutions with a fixed defined kernel spacings.

    The normal convolutions that we generally use can also be considered as dilated convolutions with a dilation factor of 1 (figure 1a). The convolutions with dilation factor of 2 will have a spacing of two between the two elements (or pixels, as shown in red dots in figure 1b) and their receptive field is 7×7 and so on. So, we need not downsample our images and still be able to cover global context in our convolutions. These helps in preserving the spatial information as well as the overall resolution and generates output of exactly same dimension as input images.

    They perform pretty well too in the case of semantic segmentation as shown in figure 2 of their paper.

    Advantages of using them:

    • Detection of fine-details by processing inputs in higher resolutions
    • Broader of the input to capture more contextual information
    • Faster run-time with less parameters

    If you are interested further, you can also read their extended work on residual network (CVPR 2017). [2]


    To prove something like this statement, we would first have to formulate a rigorous definition of what “invariance” is. Note that there are different types of invariance (e.g. translation, rotation, scale) and it is important to specify which we are talking about (e.g. SIFT features are scale-invariant but not always invariant to changes in illumination or perspective).

    For now, lets consider translation invariance since that is a form of invariance gained by using pooling layers. We will define invariance to mean that the class label of the prediction does not change if the image is translated by a small amount.

    Let’s take a concrete and simplified example. Suppose your CNN is predicting if the image is of a cat face, and that the last layer of the network (before the final classification layer) produces a dense pixel-wise feature map of the probability of the cat being present at every pixel location. Suppose that the activation is particularly high at one of the pixels, but low everywhere else in the image. What should the final layer look like to provide translation invariance?

    Consider a max pooling layer with pooling region the entire input feature map. Notice that we could translate the image arbitrarily, so long as the high activation pixel remained inside the input feature map, and the result of maxing over the entire feature map would still be the same high-probability prediction of the cat being present. This is the translation invariance that we wanted!

    Note that the invariance offered by pooling is more general than this example I gave. Specifically, we are not required to be at the final layer of the network (e.g. pooling over intermediate convolutional kernel activations), have pixel wise predictions (e.g. pooling over receptive fields instead of individual pixels), or constrain the size of the pooling region (e.g. local invariance rather than global invariance).

    To summarize, by pooling over a region it doesn’t matter where exactly within the pooling region the high activation is because max pooling will ignore everything except that highest activation input. Other forms of pooling (e.g. average pooling) offer similar invariance but allow the other non-maximum parts of the pooling region to contribute more to the output.

    In the past, it was used the average pooling. It is one of the most obvious way to perform a sub-sampling. Max-pooling is equally simple, but has showed better empirical results in practice; this doesn’t mean max pooling works *alwaysbetter than average pooling. It’s difficult to prove anything in deep learning. We are primarily guided by intuition and empirical results.

    Randomized version of pooling have also been explored. See here: [1512.07108] Recent Advances in Convolutional Neural Networks

    EDIT: to really answer your question… I actually feel max-pooling very intuitive. In general, pooling works because it helps to introduce “invariance” to small translation and deformation of the objects. This is accomplished loosing precision about where a feature occurs: after a 2×2 pooling we know whether a feature occurs in every of the 2×2 regions, but we don’t know anymore exactly where it occurs into that regions. This way, the net learns to be robust with respect to little distortions of the object. Taking the max value of the activations of a region it’s an obvious way to check the presence of a feature in that region, that is exactly what pooling is meant to do. I can’t think a more intuitive way to do that!

    Max Pooling is a downsampling strategy in Convolutional Neural Networks. Please see the following figure for a more comprehensive understanding (This figure is from my PhD thesis). [Quora some how blurs the image]

    Here in the figure, we show the operation upon the pixel space. Alternatively we can do a similar operation on some other mathematical space. Also, one can change the operation of taking ‘Max’ to something else, say taking an ‘Average’ (This is what is done in Average Pooling).

    Generally, for pedagogical purposes, the depiction of max pooling is made for non overlapping regions. This sometimes leads to a conjecture that max pooling is usually performed without overlaps. However, in reality, this notion is mostly not followed. In almost all of the famous CNN architectures, max pooling has been performed with overlapping regions. [Kernel Size, Stride] – AlexNet = [3×3, 2]; GoogleNet = [3×3, 2] , [3×3, 1]; VGG_CNN_S = [3×3,3], [2×2,2]; VGG_CNN_M and variants = [3×3, 2]; VGG_CNN_F = [3×3, 2]. We have thus shown in the figure all max pooling variants across the famous CNN architectures ([3×3,3] is similar in nature to [2×2,2]).

    One can Google these configurations or refer to deploy files in BVLC Caffe !!

    The pooling overlaps are in fact necessary in CNNs. As was pointed out by Hinton, that without overlaps, pooling operation may lose important information regarding the location of the object.

    Quoting the first paper from the Google search for “global average pooling”. http://arxiv.org/pdf/1312.4400.pdf

    Instead of adopting the traditional fully connected layers for classification in CNN, we directly output the spatial average of the feature maps from the last mlpconv layer as the confidence of categories via a global average pooling layer, and then the resulting vector is fed into the softmax layer. In traditional CNN, it is difficult to interpret how the category level information from the objective cost layer is passed back to the previous convolution layer due to the fully connected layers which act as a black box in between. In contrast, global average pooling is more meaningful and interpretable as it enforces correspondance between feature maps and categories, which is made possible by a stronger local modeling using the micro network. Furthermore, the fully connected layers are prone to overfitting and heavily depend on dropout regularization [4] [5], while global average pooling is itself a structural regularizer, which natively prevents overfitting for the overall structure.


    In this paper, we propose another strategy called global average pooling to replace the traditional fully connected layers in CNN. The idea is to generate one feature map for each corresponding category of the classification task in the last mlpconv layer. Instead of adding fully connected layers on top of the feature maps, we take the average of each feature map, and the resulting vector is fed directly into the softmax layer. One advantage of global average pooling over the fully connected layers is that it is more native to the convolution structure by enforcing correspondences between feature maps and categories. Thus the feature maps can be easily interpreted as categories confidence maps. Another advantage is that there is no parameter to optimize in the global average pooling thus overfitting is avoided at this layer. Futhermore, global average pooling sums out the spatial information, thus it is more robust to spatial translations of the input.

    We can see global average pooling as a structural regularizer that explicitly enforces feature maps to be confidence maps of concepts (categories). This is made possible by the mlpconv layers, as they makes better approximation to the confidence maps than GLMs.

    In the case of classification with 10 categories (CIFAR10, MNIST).

    It means that if you have a 3D 8,8,128 tensor at the end of your last convolution, in the traditional method, you flatten it into a 1D vector of size 8x8x128. And you then add one or several fully connected layers and then at the end, a softmax layer that reduces the size to 10 classification categories and applies the softmax operator.

    The global average pooling means that you have a 3D 8,8,10 tensor and compute the average over the 8,8 slices, you end up with a 3D tensor of shape 1,1,10 that you reshape into a 1D vector of shape 10. And then you add a softmax operator without any operation in between. The tensor before the average pooling is supposed to have as many channels as your model has classification categories.

    The paper is not clear, but when they say “softmax layer” they mean softmax operator only, not a fully conneted layer with a softmax activation.

    Not y=softmax(W*flatten(GAP(x))+b) but y=softmax(flatten(GAP(x))).

    In order to decide whether you need to do a max pool or an average pool you need to understand how pooling helps in CNN. While the convolution operation helps in obtaining the features maps the pooling operations play an important role in compressing it. Pooling helps to give a little shift invariance along with some invariance to distortion.

    Now the two methods of pooling act slight differently though both try to achieve the same objective. Mean pooling while takes the mean of a block. This is where if you want the impact of all values from a region something like a mean pool could be helpful.However , in a particular case where you have equivalent positive and negative values the resultant activation could be small.

    While the max pooling takes the max of a block. If you want only the maximum value to be given importance from within the block a max pool could be used. So it depends primarily on the kind of application you want. Max pool helps to pick on the most salient features well.Though neither have a definite advantage over the other , experimentation is a more suitable way to find out.

Buy CBD Oil Ohio