World's most popular travel blog for travel bloggers.

# [Solved]: Convolutional Neural Network Feature Engineering?

, ,
Problem Detail:

I'm working through the tensorflow tutorial, and I see how you go from 28 x 28 to zero-padding and applying a 5x5x32 convolution to get 28x28x32 and max-pooling etc. What I'm confused about is the 32 outputs seem to be described as filters acting upon the 5x5 layers. The description is typically something like these are equivalent to filters being used to feature engineer, but when you set up the network you don't define any filters. So how do the filters come about?

I read this question as well Intuition for convolution in image processing and although it's informative, it doesn't describe how the filters come about.

This is the tutorial, you can see the description of the layers in the first and second convolutional layers: https://www.tensorflow.org/versions/r0.7/tutorials/mnist/pros/index.html

#### Answered By : Edo Cohen

I think the base of your confusion relays on the assumption that 2D convolution receives a 2 dimensional array as input. This is not true - The input of a 2D convolution layer is a 3 dimensional array. At the beginning you have an image which can be greyscale/RGB, let's consider an RGB image of 28x28, this image is represented as a tensor of 3x28x28 (greyscale will be 1x28x28), generally - colors x width x height as the colors can exceed 3 for opacity etc. A 2D convolutional layer with patch size 5x5 will be a applied over each patch of size 5x5 of the images over all colors. To sum it up, slice the image to cubes of patch size (which overlap according to the stride parameter) each of these patch is taken with all it's colors as the input to the filter.

The process I've described above is the application of a single filter over the input. The result of "running" the filter over the input is a matrix (as the dot product of each patch of the image with the filter is a scalar). Now, take 32 such filters stacked on top of the other, the output will be 32 matrices where each matrix is the application of a different filter over the input.

As far as dimensions go, let's assume you apply 5x5x32 over a 1x28x28 image (without padding & stride=1) you'll have: $\frac{input\ size-patch\ size}{stride}+1$, where I assumed everything is symmetrical for both axises. So you have output of $\frac{28-5}{1}+1=24$, which means the second layer is of size 32x24x24.

You can either think of the second layer as an "image" of size 24x24 with 32 "colors" which is weird but some people do it, or you can think about everything in terms of 3 dimensional arrays and understand exactly how a filter is applied over each 3 dimensional array.

Why is this referred to as filters?

Let's consider the case where you apply a 2D convolutional layer with parameters 5x5x1. What does this layer actually do? How does it look? (consider the case where your images are greyscale for simplicity), The layer is actually a dot product of each patch of size 5x5 with a 2D weight matrix of size 5x5. Organized to form a 2 dimensional array of results (which is the output layer). The 5x5 weights matrix is called "filter". You can think of it as a gaussian for a moment, the 5x5 weight matrix is applied over each patch of the image thus blurring the image. This is applying a gaussian filter over the image. Now instead of a gaussian filter, you learn the weights and have a general filter. Now consider a layer of 5x5x32, it is the same, but with 32 different filters.