We will explore a small sample of the UGR’16 dataset together in this section. UGR16 contains network traffic informaiton and is designed to test modern Intrusion Detection Systems (IDS).
This page and the next go into detail about useful data manipulation utilities using the python programming libraries Pandas and Scikit-learn. These libraries are bundled with the software Anaconda, if you have followed the Environment Setup tutorial you will already Anaconda installed.
- Pandas is a data analysis library for the python language which provides a great deal of useful tools and functions for data analysis, including data preperation.
- Scikit-learn is a machine learning libary which includes a number of preprocessing tools which we will learn to use below.
Get the Data
The UGR16 dataset is huge, much like all of the datasets listed in Existing Datasets & Data Sources. The ‘July Week 5’ csv file is 52GB uncompressed, so I have provided the first 5,000 rows for us to look at, it’s only 483kb.
You can upload files into Jupyter Notebook using the ‘upload’ button on the right.
You can also use code to download a dataset, allowing reusability and automaiton. We will use this method throughout the course. You will build up a series of these code ‘tools’ as you work through this course which will be useful for your own projects. The script below will download the dataset into the working directory of Jupyter Notebooks, so you can access the dataset from within a notebook. The code follows a popular format of modularity, with seperate variables allowing slightly increase ease of use.
import requests DOWNLOAD_REPO = "https://raw.githubusercontent.com/krisbolton/machine-learning-for-security/master/" DOWNLOAD_FILENAME = DOWNLOAD_REPO + "ugr16-july-week5-first5k.csv" DATASET_FILENAME = "ugr16-july-week5-first5k.csv" response = requests.get(DOWNLOAD_FILENAME) response.raise_for_status() with open(DATASET_FILENAME, "wb") as f: f.write(response.content) print("Download complete.")
DOWNLOAD_REPO is the URL a repository containing datasets,
DOWNLOAD_FILENAME is the name of the file we want to download contained in that repository, these are combined in line 2.
DATASET_FILENAME allows you to get the filename when it is created locally. We then use the
requests library to fetch the dataset, check for errors (
.raise_for_status()), create a file object using
open(), create a file writer
write() using the content of the request, and finally print a message so we know when it’s done.
Explore the Data
Now we have the csv file ‘ugr16-july-week5-first5k’ in Jupyter we can use Pandas to read the csv and start exploring it. Pandas is a Python library used for data manipulation and analysis, with particularly useful functionality such as dataframes, a data structure containing rows and columns.
Get an Overview of a Dataset
Let’s get a summary of the data within a dataset.
import pandas as the variable
pd (a convention), read the contents of the CSV file into the variable
df (stands for dataframe (another convention)) and we use the
info() method on df which provides a basic summary of the dataframe.
import pandas as pd df = pd.read_csv("ugr16-july-week5-first5k.csv") df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4999 entries, 0 to 4998 Data columns (total 13 columns): 2016-07-27 13:43:21 4999 non-null object 48.380 4999 non-null float64 188.8.131.52 4999 non-null object 184.108.40.206 4999 non-null object 53 4999 non-null int64 53.1 4999 non-null int64 UDP 4999 non-null object .A.... 4999 non-null object 0 4999 non-null int64 0.1 4999 non-null int64 2 4999 non-null int64 209 4999 non-null int64 background 4999 non-null object dtypes: float64(1), int64(6), object(6) memory usage: 507.8+ KB
info() shows us information about each column in the dataset as rows in this output. General information is provided, number of entries (remember most data structures count from 0, 0 to 4,999 means there are 5,000 entries), 13 columns, memory usage and information about those 13 columns. The first column in
info() is the heading of the dataset columns (in this case the dataset creators didn’t use headings), the second is the number of instances of a record, third the type of entry (in this case it cannot be null) and then the data type. More informaiton can be found on the documentation about the info() function.
Visual Overview of Numerical Data
Let’s visualise the different numerical data within the dataset using matplotlib and its histogram feature.
import matplotlib.pyplot as plt df.hist(bins=50, figsize=(30,15)) plt.show()
View Data from the Dataset
So far we have used two methods to get an overview of the data within the dataset. Lets actually view some of the data.
View Data Snippets
head() function prints the first n rows from a dataset, the default is 5, however, you can pass values within the parenthases (e.g.
head(50) for the first 50). The
tail() function shows entires from the end of a dataset. Viewing snippets like this allows us to see the actual values within our dataset without viewing the whole thing - with datasets in the order of gigabytes, opening such large files can be a task in itself.
You may need to scroll right to see all of the columns in the table below.
First five rows of the UGR16 Dataset.
The UGR’16 dataset is a netflow capture of real traffic data from an ISP and synthetic attack data. When the dataset is visualised you’ll notice it has no column headings, it’s just data. To figure out what we’re looking at we search the research paper which first presented the URG’16 dataset. In the seciton describing their creation methodology we find the tool they used (nfdump) and they describe the data represtened in each column.
- Date and time of the end of a flow
- Duration of the flow
- Source IP Address
- Destination IP Address
- Source Port
- Destination Port
- Forwarding status
- Type of Service (ToS) byte
- Packets exchanged
- Bytes exchanged
Add Column Headings
We can add the column headings to the individual columns to aid our understanding as we examine the dataset. These need to be removed later when we feed our data into our machine learning algorithm. Below, we assign the names of the 13 columns as a list to the dataframe.
df.columns = ['Date time', 'Duration', 'Source IP', 'Destination IP', 'Source Port', 'Destination Port', 'Protocol', 'Flag', 'Forwarding status', 'ToS', 'Packets', 'Bytes', 'Label'] df.head()
Column headings added to UGR16 Dataset dataframe.
This page has described practical skills to explore datasets, allowing you to view and understand the data you wish to work with. Exploring datasets is a key stage in any project, the skill is used to decide if a dataset is appropriate for your needs. Once you have decided to move forward with a specific dataset exploring the data allows you to see what needs to be altered and fixed - its highly unlikely any dataset you choose will be perfect for your chosen project.
The next page discussed necessary skills and techniques you can employ to make these transformations and fixes to your chosen dataset.
Feedback is welcome!
Get in touch securitykiwi [ at ] protonmail.com.
- Maciá-Fernández, G., Camacho, J., Magán-Carrión, R., García-Teodoro, P., and Therón, R. (2018) UGR‘16: A new dataset for the evaluation of cyclostationarity-based network IDSs. Elsevier. https://www.sciencedirect.com/science/article/pii/S0167404817302353
- McKinney, W. (2017) Python for Data Analysis. O'Reilly Media. https://www.amazon.com/Python-Data-Analysis-Wes-Mckinney/dp/1491957662
- NFDUMP (2014) NFDUMP Overview. http://nfdump.sourceforge.net
- University of Granada (2016) UGR'16: A New Dataset for the Evaluation of Cyclostationarity-Based Network IDSs. https://nesg.ugr.es/nesg-ugr16/