Training Regression Models

Regression is used to predict house prices, stock prices and in the security field, it has been used to predict denial of service (DoS) severity and forecast ‘cyber weather’ as we saw with Park et al in the introduction - our initial learning here won’t be as complex. This page serves as an introduction to training regression models.

This page is split into two sections for the two different functions regression can be used; prediction and classification. We will discuss three regression techniques, each more powerful than the last - or more suited to specific tasks - and we will build an example of each.

Each section includes a link to a Google CoLab live environment for you to follow along with the code if you haven’t installed Anaconda / Jupyter Notebooks.

Contents

Regression Predictors

Linear Regression

Linear regression is a technique from the field of statistics, many aspects of machine learning are borrowed from other areas including statistics and computer science. Linear regression assumes there is a linear relationship between these values and attempts to predict a linear outcome - the best fit line is straight. In machine learning Regression is a supervised technique (takes labelled data) which processes data to find relationships between a value (x) and a predicted value (y). A line can be drawn on a graph to best fits this data. A linear relationship is defined as a proportional relationship between changes in related values; when a value increases so does the other, when a value decreases so does the other.

A popular example is the number of cricket (the insect) chirps per minute mapped to temperature; as temperature increases crickets chirp more frequently, their chirps slow down as temperature decreases. You can predict temperature from the frequency of cricket chirps, and you can predict the number of cricket chirps per minute from temperature.

A graph illustrating cricket chirps versus temperature.

An example of a linear graph, temperature versus cricket chirps.

Any straight line can be represented by the equation y = mx + b. In our example, we are trying to predict the temperature from the number of chirps per minute, y represents the predicted temperature value we seek. m is the slope of the line, x is the number of chirps per minute, and b is the intersection of the line with the y-axis (the y-intercept). If you want to think about how this works, consider our aim is to draw the best fit line, we need at least two points to get the slope m and y-intercept b gives us a third. We can now draw a line. We won’t use math on this page again, it will help should you read about linear regression anywhere else.

Despite linear regressions strengths for some tasks, it has downsides for others. It assumes a linear relationship between data inputs, so it is not useful for non-linear data. It is a relatively simple technique and cannot handle complexity. However, it has found a place in many applications due to its simplicity and ease of understanding. As well as non-security applications such as forecasting house prices, sales and the stock market, linear regression techniques have been used to forecast cyber threat intelligence with high accuracy. We discussed in the introduction Machine Learning Examples in Security how Park et al forecast ‘cyber weather’, accurately warning of mass worm attacks within large networks.

Ordinary Least Squares

Cost functions determine the optimal predicated values for us to create a line which best fits input data. Cost functions measure the difference between the actual and predicted values, this difference is called the “error”. The scikit-learn Linear Regression implementation uses the Ordinary Least Squares regression technique, where the aim is to minimise the sum of the distance from each data point to the regression line squared and the sum of all squared errors. This can be better understood with the graph below.

A graph illustrating  least square, minimising the surface area of each square creates the optimal line.

An illustrative example of least squares applied to data points.

The graph shows the purpose of the ordinary least squares function can be said to create a line which has reduced the surface area of each square to its maximum potential, thus creating a line which best fits each data point.

Of course, this complexity is hidden from the user. All we as programmers do is pass parameters into the linear_model.LinearRegression() function, while being aware of the limitations of the technique and assumptions which linear regression makes.

Train a Linear Regression Model

Let’s train a linear regression model using scikit-learns linear_model.LinearRegression() function. We will keep things simple and use a prediction dataset built into scikit-learn, the boston house prices dataset, which we will use it to predict house price trends. In later sections, we will focus more security-specific tasks which require dataset wrangling and further considerations. Here we want to understand how the model is trained and gain some exposure to machine learning with scikit-learn and the ease of using machine learning libraries.

Follow along with the code for a linear regression model in a Google CoLab live environment.

Import packages

First, we need to import various packages (also called libraries) to allow us to manipulate, view and process the data. Pandas for data manipulation, matplotlib for visualizing data, the Boston dataset and scikit-learn’s regression algorithm.

import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression

Get the data

Now we have access to those packages, we load the dataset (by assigning it dataset to a variable) and create a pandas DataFrame from it. If you’ve forgotten DataFrames from the Datasets and Data Collection section, they are similar to tables and a powerful tool in Pandas allowing easy manipulation of data, with less code and fewer headaches. The raw Boston dataset doesn’t include column headings, they’re squirrelled away in a separate variable, we add them back here for increased understanding.

# Assign dataset to variable
boston = load_boston()
# Load data from dataset into Panda's DataFrame and assign DataFrame to variable
data = pd.DataFrame(boston.data)
# Add column names to DataFrame
data.columns = boston.feature_names
# Print DataFrame
data.head()

We can also visualize the data to better understand what we are working with. Below we visualize the price, the value we will later predict, as a histogram using matplotlib hist() function. You can try applying the various methods we discussed in Exploring Datasets here too.

# Plot histagram of price (boston.target)
plt.figure(figsize=(4, 3))
plt.hist(boston.target)
plt.xlabel('price ($1000s)')
plt.ylabel('count')
plt.tight_layout()
A histagram showing the distribution of boston house prices. Prices resemble a mountain, most house prices are around 200,000 dollars.

A histagram showing the distribution of boston house prices.

Assign values to axis

Linear regression requires two independent variables, it’s likely any dataset you work on will need to be manipulated to split the independent variables into x and y variables. To illustrate this, we’ll add the price to the DataFrame, then take it away as you would with another dataset. You may recognise the code, we discussed a similar process in the Dataset Preperation page.

# Add PRICE column filled with prices from boston.target
data['PRICE'] = boston.target
# Print DataFrame, now with PRICE column
data.head()

Now to remove that extra column.

# Drop the column named 'PRICE'
data.drop('PRICE', axis = 1)

Moving on, we assign our independent variables to the x and y axis.

# Assign values to x and y
x = data
y = boston.target

Split the dataset

Now we must split our dataset into a training and test datasets, keeping some data aside to test the accuracy of our method after training. Scikit-learn includes a function for this train_test_split(), which we use on x and y to create test and training segments for each. Below we pass text_size and random_state parameters, keeping 30% of the dataset aside for testing. The random_state value controls the amount of shuffle applied to the dataset, and when provides consistency across function calls (each time you use the function). See the functions documentation for more.

from sklearn.model_selection import train_test_split
# Split the dataset, keeping 30% aside for the test dataset
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 4)

Train a model

Invoking the LinearRegression() function, we train the algorithm by fit the training data to it. lr stands for LinearRegression, just a short variable name.

# Assign LinearRegression algorithm to variable 
lr = LinearRegression()
# Train the model using the training sets 
lr.fit(x_train, y_train)

Predict prices

Now our model is trained, we apply it to form some predictions and visualize the results.

# Predict values using LinearRegression() lr
y_pred = lr.predict(x_train)

# Plot price predictions
plt.figure(figsize=(6, 4))
plt.scatter(y_train, y_pred)
plt.plot([0, 50], [0, 50], '--k')
plt.xlabel("Actucal price ($1000s)")
plt.ylabel("Predicted price ($1000s)")
plt.tight_layout()
A graph showing the predicted and actual prices of boston houses.

A graph of the predicted and actual prices of boston houses.

Evaluate

We can calculate metrics, such as the Mean Squared Error (MSE) by using the scikit-learn metrics package. We will discuss further methods of evaluation in the Model Evaluation page ahead.

from sklearn import metrics
# Calculate and print MSE
print('MSE:', metrics.mean_squared_error(y_train, y_pred))
MSE: 19.07368870346903

Mean Squared Error

Mean Squared Error (MSE) is a method of estimating the accuracy of a predicted value, it measures the average squared difference between a predicted value and the actual value. MSE is a common way of measuring the accuracy of a linear regression model. We used scikit-learn’s MSE function mean_squared_error() from the metrics package above to compare the training data (actual values) and the predicted values to determine the error. MSE outputs a positive value, the lower the value the less the difference between the predicted and actual values and the better the technique you are measuring has performed.

Polynomial Regression

Polynomial regression is a technique which allows us to work with more complex data which does not fit to a straight line. We can use a simple scikit-learn function PolynomialFeatures() to preprocess our data and utilise the same linear regression function LinearRegression() above on non-linear data. For example, the Boston Housing Dataset contains some non-linear relationships.

The graph below shows the relationship between the boston.target value, the median house value (MEDV) and the ‘lower status of the population’ (LSTAT), a measurement of the proportion of adults without, some high school education and proportion of male workers classified as labourers.

.

Median house value (MEDV) vs. LSTAT, non-linear data.

As you can see, a straight line would not accurately represent the data. We implement polynomial regression below.

You can follow along in a Google CoLab live environment.

Import packages

import operator
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.datasets import load_boston
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

Load dataset

Load the data in the same way as we did previously.

# Assign dataset to variable
boston = load_boston()
# Load data from dataset into Panda's DataFrame and assign DataFrame to variable
data = pd.DataFrame(boston.data)
# Add column names to DataFrame
data.columns = boston.feature_names
# Print DataFrame
data.head()

Assign data to axis

We assign the LSTAT colum to the x axis, and the MEDV data to the y axis.

x = data['LSTAT']
y = boston.target

You can visualize the data on a scatter graph here.

# Plot scatter graph of x, y
plt.scatter(x,y, s=10)
# Axis labels
plt.xlabel("Lower Status of Pop (LSTAT)")
plt.ylabel("Mediam House Value (MEDV)")
# Show graph
plt.show()

Create polynomial features

Transform the data points into polynomial features using the scikit-learn function.

# Assign PolynomialFeatures() function to variable for use
poly = PolynomialFeatures(degree=2, include_bias=False)
# Tranform and fit the x axis. Reshape transforms the data to the correct size for the operation.
x_poly = poly.fit_transform(x.values.reshape(-1,1))

Make predictions

Invoke LinearRegression() as pr (polynomial regression) and train the model using fit(). Once trained make predictions predict().

# Assign the LinearRegression() function to variable for use
pr = LinearRegression()
# Train model
pr.fit(x_poly, y)
# Make predictions
y_pred = pr.predict(x_poly)

Visualize results

Below we visualize our results, fitting a curved best fit line onto the dataset. Some wrangling was required to get a single line rather than a mess of lines, I believe this is because the PolynomialFeatures() didn’t order its output. Below we sort the values and plot the resulting single line. Thanks to a blog post on Towards Data Science for the sorting fix.

# Plot data
plt.scatter(x, y, s=10)
# Graph labels
plt.xlabel("Lower Status of Pop (LSTAT)")
plt.ylabel("Mediam House Value (MEDV)")
# These lines courtesy of Towards Data Science post
sort_axis = operator.itemgetter(0)
sort = sorted(zip(x,y_pred), key=sort_axis)
x, y_pred = zip(*sort)
# Plot line
plt.plot(x, y_pred, color='#ee8866')
# Plot layout styles (default)
plt.show()
.

A graph showing polynomial regression fitting a non-linear best fit line.

Regression Classifiers

The regression techniques we have discussed so far are not suited to classification problems. Andrew Ng, a world-renowned Stanford AI scientist and course leaders of the popular Coursera machine learning course, describes the issue with little math in an early Stanford YouTube video lecture. Basically, linear regression is sensitive to outliers and large differences between independent variables. Below we work with Logistic Regression which is suitable for classification tasks.

Logistic Regression

The logistic function, also called the sigmoid function, forms the core of logistic regression. The technique performs the same action as linear regression but compresses input values between the range of 0 and 1, represented by an S-shaped curve. These values can be mapped to probabilities, for example, 0.2 is 20% 0.85 is 85%. It can be said that logistic regression creates a probability of an input sitting within a class. For example, is this email spam or not?

Logistic Regression also known as the Sigmoid function, an S-shaped curve between y axis values 0 and 1.

Logistic function, also known as the sigmoid function.

We will use scikit-learn’s LogisticRegression() function implements the technique.

Train a Logistic Regression Model

We will work with a large dataset, moving on from our built-in datasets for this example, and use the MNIST handwritten digit dataset. 70,000 sample images of handwritten digits which we are going to identify using machine learning. Scikit-learn has a handy function for getting this data fetch_openml. The function access the open source machine learning datasets from www.openml.org, a large repository hosted by two Dutch universities; TU Eindhoven, Leiden University.

You can follow along in a Google CoLab live environment.

Get data

First, we import the various package we will need.

import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_openml
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

We load the data via fetch_openml, you can see futher information on the MNIST dataset on OpenML or the original authors page.

# Get MNIST dataset
mnist = fetch_openml('mnist_784')

Split the dataset

We split the dataset into training and test datasets. MNIST has 60,000 training points and 10,000 testing points. We use a smaller number to decrease train time. The reduction doesn’t negatively affect accuracy, this was tested through multiple runs at different sizes.

# Split dataset into 20K training and 10K test data
train_img, test_img, train_lbl, test_lbl = train_test_split(
 mnist.data, mnist.target, train_size=20000, test_size=10000, random_state=0)

View the data

We can view the MNIST data. We loop through the images and labels in the training dataset and plot them to subplots using matplotlib.

# View the dataset. Loop through image and label and create matplotlib sublots for each
plt.figure(figsize=(15,4))
for index, (image, label) in enumerate(zip(train_img[0:5], train_lbl[0:5])):
 plt.subplot(1, 5, index + 1)
 plt.imshow(np.reshape(image, (28,28)), cmap=plt.cm.gray)
 plt.title('Training: %s\n' % label, fontsize = 20)
Example handwritten digits from the MNIST dataset.

Examples of handwritten digits from the MNIST dataset.

Train the model

Now we can train our logistic regression classifier. We set a specific optimization algorithm (solver), in this case, this one seemed to work best. And a tolerance to stopping criteria, without tol set to a small value the model failed to converge, that is finding an optimal fit. We will discuss convergence in the section ahead Evaluation & Tuning.

# Assign LogisticRegression() function to variable for use.
# Set to specific solver, and set tolerance for stopping criteria (tol)
lr = LogisticRegression(solver='saga', tol=0.1)
# Train model
lr.fit(train_img, train_lbl)

View results

We can view a number of prediction by printing out a range, single predictions can be printed passing a single value, e.g test_img[0].

# Predict and print predictions 0 - 10
lr.predict(test_img[0:10])

Finally, we print the accuracy of the predictions versus the test data.

# Assign and print accuracy score
score = lr.score(test_img, test_lbl)
print(score)
0.9127

Our logistic regression model achieved 91% accuracy.

Summary

We have learned linear regression is a supervised machine learning technique, using labelled data to predict values which have a linear relationship with input data. Due to its simplicity linear regression is not suitable for complex tasks, nor is it suited to classification problems due to its sensitivity to outliers and large differences between input values and prediction output. We implemented and learned a number of functions to create a linear regression model in scikit-learn to predict the price of houses in Boston.

We learned a more flexible technique which allows us to find the line of best fit on datasets which are not suited to straight lines, polynomial regression. The Scikit-learn implementation uses the LinearRegression() and PolynomailFeaturs() functions to perform this.

Finally, we learned about Logistic Regression. A technique which allows us to work with classification problems using regression, which typically are not suited to techniques such as linear regression. We processed the MNIST dataset to classify hand-draw digits.

On the next page, we learn about neural networks and how they are trained.


References

  1. Kaggle (2020a) Boston Housing: Housing Values in Suburbs of Boston. https://www.kaggle.com/c/boston-housing
  2. Kaggle (2020b) Predict number using Logistic Regression with 92% https://www.kaggle.com/pranjalsrv7/predict-number-using-logistic-regression-with-92
  3. LeCun, Y., Cortes, C., and Burges, C. (2002) The MNIST Dataset of handwritten digits. http://yann.lecun.com/exdb/mnist/
  4. OpenML (2014) mnist_784. https://www.openml.org/d/554
  5. Open Data StackExchange (2019) What does “lower status” mean in “Boston house prices dataset”? https://opendata.stackexchange.com/questions/15740/what-does-lower-status-mean-in-boston-house-prices-dataset
  6. Park, H., Sung-Oh, D., Lee, H., and Hoh, P. (2012) Cyber Weather Forecasting: Forecasting Unknown Internet Worms Using Randomness Analysis. Springer. https://doi.org/10.1007/978-3-642-30436-1_31
  7. Ray, S. (2019) A Quick Review of Machine Learning Algorithms. IEEE. https://ieeexplore.ieee.org/document/8862451
  8. Scikit-Learn (2020) sklearn.datasets.load_boston. Scikit-Learn Documentation. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_boston.html
  9. Scikit-Learn (2020) sklearn.datasets.fetch_openml. Scikit-Learn Documentation. https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_openml.html
  10. Scikit-Learn (2020) sklearn.linear_model.LogisticRegression. Scikit-Learn Documentation. https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
  11. Scikit-Learn (2020) Linear Models: Logistic Regression. Scikit-Learn Documentation. https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

Introduction to Training Models
Training Neural Networks