*Blake Elias** is a Researcher at the **New England Complex Systems Institute.**Shawn Jain** is an **AI Resident at Microsoft Research**.*

Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window.

We present a pooling method for convolutional neural networks as an alternative to max-pooling or average pooling. Our method, softmax-weighted average pooling (SWAP), applies average-pooling, but re-weights the inputs by the softmax of each window. While the forward-pass values are nearly identical to those of max-pooling, SWAP’s backward pass has the property that all elements in the window receive a gradient update, rather than just the maximum one. We hypothesize that these richer, more accurate gradients can improve the learning dynamics. Here, we instantiate this idea and investigate learning behavior on the CIFAR-10 dataset. We find that SWAP neither allows us to increase learning rate nor yields improved model performance.

While watching James Martens’ lecture on optimization, from DeepMind / UCL’s Deep Learning course, we noted his point that as learning progresses, you must either lower the learning rate or increase batch size to ensure convergence. Either of these techniques results in a more accurate estimate of the gradient. This got us thinking about the need for accurate gradients. Separately, we had been doing an in-depth review of how backpropagation computes gradients for all types of layers. In doing this exercise for convolution and pooling, we noted that max-pooling only computes a gradient with respect to the maximum value in a window. This discards information — how can we make this better? Could we get a more accurate estimate of the gradient by using all the information?

Max-pooling discards gradient information — how can we make this better?

Max-Pooling is typically used in CNNs for vision tasks as a downsampling method. For example, AlexNet used 3×3 Max-Pooling. [cite]

1. Machine Learning Concepts Every Data Scientist Should Know

2. AI for CFD: byteLAKE’s approach (part3)

3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data

4. Top 5 Jupyter Widgets to boost your productivity!

In vision applications, max-pooling takes a feature map as input, and outputs a smaller feature map. If the input image is 4×4, a 2×2 max-pooling operator with a stride of 2 (no overlap) will output a 2×2 feature map. The 2×2 kernel of the max-pooling operator has 2×2 non-overlapping ‘positions’ on the input feature map. For each position, the maximum value in the 2×2 window is selected as the value in the output feature map. The other values are discarded.

The implicit assumption is “bigger values are better,” — i.e. larger values are more important to the final output. This modelling decision is motivated by our intuition, although may not be absolutely correct. [Ed.: Maybe the other values matter as well! In a near-tie situation, maybe propagating gradients to the second-largest value could make it the largest value. This may change the trajectory the model takes as its learning. Updating the second-largest value as well, could be the better learning trajectory to follow.]

You might be wondering, is this differentiable? After all, deep learning requires that all operations in the model be differentiable, in order to compute gradients. In the purely mathematical sense, this is not a differentiable operation. In practice, in the backward pass, all positions corresponding to the maximum simply copy the inbound gradients; all the non-maximum positions simply set their gradients to zero. PyTorch implements this as a custom CUDA kernel (this function invokes this function).

In other words, Max-Pooling generates sparse gradients. And it works! From AlexNet [cite] to ResNet [cite] to Reinforcement Learning [cite cite], it’s widely used.

Many variants have been developed; Average-Pooling outputs the average, instead of the max, over the window. Dilated Max-Pooling makes the window non-contiguous; instead, it uses a checkerboard like pattern.

Controversially, Geoff Hinton doesn’t like Max-Pooling:

The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster.

If the pools do not overlap, pooling loses valuable information about where things are. We need this information to detect precise relationships between the parts of an object. Its [

sic] true that if the pools overlap enough, the positions of features will be accurately preserved by “coarse coding” (see my paper on “distributed representations” in 1986 for an explanation of this effect). But I no longer believe that coarse coding is the best way to represent the poses of objects relative to the viewer (by pose I mean position, orientation, and scale).[Source: Geoff Hinton on Reddit.]

Max-Pooling generates sparse gradients. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?

Sparse gradients discard too much information. With better gradient estimates, could we take larger steps by increasing learning rate, and therefore converge faster?

Although the outbound gradients generated by Max-Pool are sparse, this operation is typically used in a Conv → Max-Pool chain of operations. Notice that the trainable parameters (i.e., the filter values, ** F**) are all in the Conv operator. Note also, that:

** dL/dF = Conv(X, dL/dO)**, where:

are the gradients with respect to the convolutional filter*dL/dF*is the outbound gradient from Max-Pool, and*dL/dO*is the input to Conv (forward).*X*

As a result, all positions in the convolutional filter ** F** get gradients. However, those gradients are computed from a sparse matrix

**instead of a dense matrix. (The degree of sparsity depends on the Max-Pool window size.)**

*dL/dO*Forward: