Convolutional Neural Networks (CNN) are a type of neural network architecture inspired by the architecture of the visual cortex in mammal brains. This area of the brain is optimized to process the three-dimensional data of the real-world; height, width, depth. Neurobiologists don’t appreciate such comparisons, as the similarity to real biological systems is vague, neural networks are simplified models of biological architecture. However, in the computing world, these tools which have been inspired by nature have surpassed existing tools performance and their similarity to biological neural systems fuels the excitment from readers.
An Extra Dimension
A CNN’s ability to process three-dimensional data allows information to be represented in an extra dimension, essentially allowing finer granularity of the representation and a greater range of relationships. Consider a list, a one-dimensional data store with rows, versus a table, a two-dimensional store with rows and columns. A list is useful, we can check items from a list as we pick up our groceries, but we can’t gain much insight from analysing it. We need to add and organise information into columns. Even simple information such as grocery items and quantity are tabulated data. Taking the analogy beyond a normal shopping list, we can tabulate item nutrition; enegry, saturated fat, sugars, salt etc. Tabulating data allows us to go far beyond the functionality of a simple list - we can establish relationships between items. We can cluster high-fat items or low sugar items for example.
List vs Table. The additional axis (dimension) allows greater insight into the same data.
The addition of a third dimension similarly increases the power of information representation. CNN’s have found a place in text analysis for this reason. Their ability to plot words in high-dimensional space, which can reveal meaning which is otherwise unseen. In this example, high-dimensional space can be seen as adding dimensions to words. An analogy can illuminate this: a person can be seen to have dimensions; date of birth, place of birth, hometown, mother, father, siblings etc. If a data point representing a person and their associated information is plotted into high-dimensional space, clusters can be formed and insights derived. Families could be clustered or ages could be clustered and insight derived from those clusters.
High-dimensional space (Google Developers).
The gif above shows a visualization of data clustered into high-dimensional space. The MNIST dataset, a corpus of handwritten single digits was analysed to computationally distinguish written numbers. In the image, you can see numbers are clearly clustered together by a machine learning algorithm. You can see occasional errors where the formation of a number has been identified incorrectly, for example, the number 4 for mistaken for the number 9.
The MNIST dataset in high-dimensional sapce (Google Developers).
A video from Google Developers explains the concept of high-dimensional space (Go ahead, I’ll wait while you watch). The additional dimension allows the machine learning algorithm (the CNN) to learn which numbers are close to others and thus which are likely to be the same. This isn’t so different to how other machine learning algoirthms operate, measuring the distance between data points to determine in which category a data point belongs.
Convolutional Neural Networks in Security
Researchers (Jeon et al) used the 3-dimensions of a convolutional neural network to learn the behaviour of Internet of Things (IoT) malware by encoding behavioural data recorded by sensors in images. Data was gathered from three areas; memory, ‘integrated behavioural features’ such as system calls, processes, network information etc, and behavior frequency, where the previously collected data was analysed for frequency. The sensor data was converted into data the CNN can understand, integers, and the data was rescaled to ensure the best possible outcome (see Handing Non-numerical Data and Feature Scaling Dataset Preperation). The final stage in the process was ‘channelization’, where the data is encoded into pixels of images in the seperate Red, Green, Blue (RGB) channels and combined in a single RBG image.
Channelization' stage in Jeon et al's system.
These complete RGB images now contain a great deal of sensor data representing the behaviour of the IoT systems they were collected from. Different patterns of behaviour create signitures which can be analysed by the CNN to determine which systems contain benign, non-malicous behaviour and which contain suspicious behaviour which is likely malware.
Benign behavioural visual signature (Jeon et al).
Malicious behavioural visual signiture (Jeon et al).
Using this method Jeon et al achieved the highest accuracy of the techniques they compared their own method against, 99.2% accuracy, and the lowest false positive rate of 0.63%. Jeon et al point out the weakness of the approach, which uses dynamic analysis alone to study the malicious behaviour and they do not attempt to mitigate the ability of malware to avoid analysis.
This example is a useful showcase on how data can be encoded into different forms appropriate for different machine learning systems. Jeon et al’s research article is open access if you would like to learn more - Dynamic Analysis for IoT Malware Detection with Convolutional Neural Network model.
Technical Details of Convolutional Neural Networks
Convolutional neural networks are similar to other neural networks; they are made up of individual neurons with learnable weights and biases, and they calculate the output of functions based on the purpose of that layer of the network. However, they have a number of technical features which differenciate them from other techniques. Below, we discuss the internals of CNN’s to gain an understanding of how they work.
A convolutional neural network architecture.
Convolutional neural networks are optimized to process images. Regular neural networks cannot scale well to the typical sizes of images. Say an image is 300 pixels wide and 300 pixels high, to process this image our regular network would have 270,000 individual weights which need to be updated (300 x 300 x 3 = 270,000 (3 because three color channels - RGB)). The fully-connected layers are a disadvantage here and create an enormous cost to using regular neural networks for image processing. CNN’s use a technique we explore below to reduce the complexity of image data called convolutions, where filters pass over the image and extract features from that area to pass on to the next layer.
Left: the image humans percieve. Right: the image computers percieve.
Convolutional neural networks ‘see’ the image as a vector of values, representing the data at each pixel (as seen above). If you have ever used a hex editor to ‘view’ an image you’ll have seen similar output.
The image below shows a convolutional neural network’s convolution layer with an input image size of 32x32 (much like the images in the CIFAR-10 dataset). Images are processed individually, sections of an image are scanned by a filter, the convolution layer, the filter moves accross the image multiplying the filters values with the original values of each pixel to produce a single output for the filter position. This creates a grid called a feature map, or activation map, of values representing the output for each filter position.
Left: Input image. Right: Feature map.
The image below shows the result of filter calculations. The image on the left is transformed into the feature map on the right. In this example, you can see the outline of the a shape similar to the letter “H” from the input image as data in the grid, while the “blank” area is represented by zero values. Note, algorithms may not see this as blank, the colour white still has a value to an algorithm.
The symbol from the left (H) represented in the example feature map on the right.
Typically, a CNN model looks similar to figure . In this example the network is designed to predict the contents of an image; features of the image are extracted by filters (also called kernels) being applied during the convolution layers. We can think of these filters viewing a small part of an image (or whatever is contained within the matrix) in much the same way you would use a camera to take a panorama, or multi-frame photograph. This is clarified in figure 2.7. The frame (the area in yellow) passes across the field of view (the area in green), instead of taking photographs the filter is gathering features to pass onto the next layer in the network. We can see the filter incrementally moving across the matrix row by row. The number of pixels which are skipped as the filter moves across the matrix is called a stride. The stride can be used to adjust the sampling of a feature set; a small stride results in a dense feature set with only a large portion of the matrix sampled, a large stride reduces density and samples a smaller proportion of the matrix. Large strides will result in some information being lost during the convolution layers as that information is skipped. We aim to reduce the sample as much as we can before we encounter negative effects, down sampling in this way is a common practice which makes our network less complex and easier to train. Negative effects from strides can be overcome using a technique called pooling.
Pooling layers mitigate the loss of information from strides. Added after the convolution layer and activation layer (e.g. ReLU), the pooling layers operation is similar to a filter being applied, typically much smaller, in 2x2 pixel grids. This small grid gathers a new set of data points for pooled feature maps. In our 2x2 example, it will reduce the feature map by a factor of 2.
There are two algorithms commonly applied to pooling layers; Max and Average. The max algorithm takes the maximum response from the area around the filter, the average algorithm takes the average response from the area around the filter. This allows some of the information lost in the stride to be captured and passed to the next layer.
On this page we have learned about the additional dimension used by convolutional neural networks, a useful way to understand CNNs as beginners. We learned the term ‘high-dimensional space’ and how extra dimensions can allow new analysis and insights to be discovered. We went into detail on how convolutional neural networks function; passing over each pixel of data, striding over data to avoid the need to sample every pixel. Implementing pooling to avoid any negative effects of skipping data during each stride.
N.B. code examples will be added to this page in a future update to allow a more intuative understanding of aspects such as pooling. Check back in a few weeks.
- Jeon, J., Park J. and Jeong, Y. (2020) Dynamic Analysis for IoT Malware Detection with Convolutional Neural Network model. IEEE. https://ieeexplore.ieee.org/document/9097224
- Google Developers (2016) A.I. Experiments: Visualizing High-Dimensional Space. YouTube. https://www.youtube.com/watch?v=wvsE8jm1GzE
- Stanford (2020a) Deep Visual-Semantic Alignments for Generating Image Descriptions. https://cs.stanford.edu/people/karpathy/deepimagesent/
- Stanford (2020b) CS231n: Convolutional Neural Networks for Visual Recognition. https://cs231n.github.io/
- Stanford (2020c) Convolutional Neural Networks (CNNs / ConvNets). https://cs231n.github.io/convolutional-networks/