synack.blog

Introduction


Nowadays, machine learning applications in the cyber security domain are very popular. Since i started to take Applied Machine Learning in Python course, I have been trying to apply machine learning algorithms to infosec cases. In this article, I focused on the detection of dos/ddos attacks by using incredibly accurate and simple K-Nearest-Neighbors algorithm.

I coded this project with python and the following commonly used libraries.

I used KDDCUP99 dataset for training. Normally, this dataset shouldn't be used for training the real systems[1]. Since this study is for learning purposes I used this dataset freely.

KDDCup99 Dataset


K99 dataset created by DARPA in 1998. And it used at KDD-CUP competition in 1999. That's why it's named KDDCup99. Dataset contains 41 features. Some of these features extracted from network packets. But 10 of these features are host-based information and only gained from compromised hosts. For example:

I deleted these host-based features. Because I inspect only network packets for detection.

Part of the dataset is shown below.

duration protocol_type service flag src_bytes dst_bytes land wrong_fragment urgent hot num_failed_logins logged_in num_compromised root_shell su_attempted num_root num_file_creations num_shells num_access_files num_outbound_cmds is_host_login is_guest_login count srv_count serror_rate srv_serror_rate rerror_rate srv_rerror_rate same_srv_rate diff_srv_rate srv_diff_host_rate dst_host_count dst_host_srv_count dst_host_same_srv_rate dst_host_diff_srv_rate dst_host_same_src_port_rate dst_host_srv_diff_host_rate dst_host_serror_rate dst_host_srv_serror_rate dst_host_rerror_rate dst_host_srv_rerror_rate label
0tcphttpSF18154500000010000000000880.000.000.000.001.000.000.00991.000.000.110.000.000.000.000.00normal
0tcphttpSF2394860000010000000000880.000.000.000.001.000.000.0019191.000.000.050.000.000.000.000.00normal
0tcphttpSF23513370000010000000000880.000.000.000.001.000.000.0029291.000.000.030.000.000.000.000.00normal
0tcphttpSF21913370000010000000000660.000.000.000.001.000.000.0039391.000.000.030.000.000.000.000.00normal

Also, dataset contains 22 different attack types as label. Attack types are shown below.

- ftp_write
- guess_passwd
- imap
- ipsweep
- land
- loadmodule
- multihop
- neptune
- nmap
- perl
- phf
- pod
- portsweep
- rootkit
- satan
- smurf
- spy
- teardrop
- warezclient
- warezmaster
            

I aggregated and eliminated some of these labels because we will only detect DoS/DDoS traffic. Nmap, buffer overflow and other similiar attack types are out of our scope.

As i said before, K99 dataset shouldn't used with real network ids system. Because, dataset contains some obsolote attacks and host-based features. These features can't be extracted from network packets by network intrusion detection system.

Machine Learning Phases


General machine learning process has 4 stage:

Preprocessing Data


In this phase, dataset must be analyzed and visualized carefully. If there are errors or missing values in dataset

  1. If these errors are too many in one feature, this feature can be deleted.
  2. These errors can be replaced with mean of this attribute.

After data clenaing, values must be normalized. A lot of techinuques used for data normalization. These techniques must be examined very well and appropirate method for dataset should be selected. For example:

#for mass attribute 
sample = mass[i]
min_value = min(mass)
max_value = max(mass)
normalized_sample = (sample - min_value) / (max_value - min_value)

K99 dataset is a preprocessed, normalized and well cleaned data. Therefore, dataset doesn't contain any error. So, i passed these phases.

We may need to change feature types for our algorithm. For instance, K99 dataset contains categorical(protocol_type) variables. But categorical values can't be used with KNN algoritm. So, we need to delete or transform these data into numerical values. Categorical features in dataset are as follows.

Only land feature expressed with numerical values. So I used this feature as it is. But I transformed other 3 features into numerical values. For this purpose, I found all different values with command below. And I numerated this values from 0 to 3. I did this transformation for other two features.

cat kddcup_data | cut -d"," -f2 | sort | uniq
icmp
tcp
udp
Protocol Type Value
ICMP0
TCP1
UDP2

I automatized all work up to this stage with this script.

Afterwards, we may select valuable features for dataset. Feature selection process should made by domain expert. Good selection provides more accurate model and fast prediction.

Firstly, we should analyze features in terms of usability. K99 dataset has host-based features and these features aren't usable for our case. Therefore, I omitted host-based features.(10-22)

Secondly, we might use different feature selection algorithm or use PCA(Principal Component Analysis) for combining and reducing features. I used Select-K-Best method for feature selection in this study. The Select-K-Best algorithm tests the model with different feature subsets to find the desired K number of features, and measures the success of the model according to the desired metric. (Eg: chi2, f_classif, mutual_info_classif [2])

I tested model with full features and selected 5 features. So model accuracy is almost the same but evaluation time is differ. Test code is here.

frkn@frkn:~/Desktop/applied_ml$ python knn.py
Testing with full data
[+] Classifier trained in 3.27163791656
[+] Model Evaluated in 4.10666203499
[!] Test score is 0.999297772534
-------------------------------------------------
Testing with selected features
[+] Selected features
[-->] ['duration', 'src_bytes', 'dst_bytes', 'count', 'dst_host_srv_count']
[+] Classifier trained in 1.65853691101
[+] Model Evaluated in 0.355732917786
[!] Test score is 0.999044970647
-------------------------------------------------

Building Model


The second stage of machine learning process is to train model with preprocessed data. With this model we will classify previously unseen data. As I said before, I create model with K-Nearest-Neighbors algorithm.

K-Nearest-Neighbors Algorithm


KNN is supervised machine learning algorithm. So, it needs labeled data for creating model. KNN classifies the sample according to the class of nearest K point to sample point. It is called majority voting. Obviously, it's so simple but incredibly powerfull.



As it seen figure above. Class of new sample is class one when k equal one. If k equal three then sample's class will be class 2.

Normally time complexity for training KNN algorithm is O(1). So, it copies all data to generic array. But with this way, prediction complexity will be O(kdn) approximately. k for neighbors count, d for feature dimension and n for training sample size. In prediction phase, algorithm compares new sample with all data and select k nearest point. This takes a lot of time with large datasets. But in scikit-learn, KNN uses kdtree or balltree data structures default instead of array. These data structures decrease prediction time significantly but increase training time little bit. Time complexity for train kdtree is O(nlogn), predicton time complexity is O(klogn) in average. [3]

I implemented KNN algorithm with normal way but it is so slow with large datasets in testing phase. Code is here.

Implementation


Implementation is also simple with scikit-learn. First, we need install and import libraries.

import pandas as pd
from sklearn.model_selection import train_test_split #for creating train and test dataset from all data
from sklearn.neighbors import KNeighborsClassifier #scikit-learn KNN class
from sklearn.feature_selection import SelectKBest #for selecting features
from sklearn.feature_selection import chi2 #success metric for select-k-best

After, we need read and split data to training data and class label. Dataset file must contains feature names as header. Pandas read function, reads and stores these feature names as keys of dictionary.

def get_features(data):
  features = []
  for key in data.keys():
	features.append(key)
  features.remove("label")	# i remove class labels from features
  return features

data = pd.read_csv(filename)
features = get_features(data)
X = data[features] # train data
y = data["label"]	 # class labels
	

We need to select best features now.

selector = SelectKBest(score_func=chi2,k=5) # selector instance with chi2 metric
selector.fit(X,y) # calculating best five features
indexes_selected = selector.get_support(indices=True) # extracting features indexes
selected_features = [] # it will contain features names
for i in indexes_selected:
  selected_features.append(features[i])

X = data[selected_features] # new training data with selected features

If we will train and test model with the same data, we need to split dataset into two. After this partition we can create classifier instance and train it. And we can evaluate score with testset.

X_train, X_test, y_train, y_test = train_test_split(X,y) #split data into two poart
knn = KNeighborsClassifier(n_neighbors = 5) # creating classifier instance with 5 neighbors
knn.fit(X_train,y_train) # training the model
score = knn.score(X_test,y_test) #evaluating the model

Complete script is here.

Testing with Real Data


Model score is about 0.99 , but this value seems very unreal. And I don't trust K99 dataset so much. So, I tried to test model with some real data. I couldn't extract same features from pcaps very well. And i found KDDCUP99 Feature Extractor script as a result of long search. I compiled this code with Jetbrains Clion. It made building c++ code pretty easy.

First, I started syn flood with Hping and recorded with Wireshark.

hping3 -S IP --flood

After, I recorded some normal traffic like visiting facebook, telnet, file download etc. And I used kddcup99 extracor to extracting data from pcaps. I preprocessed extracted data with this script. After these steps, I tested model with this data. Surprisingly, it works like a charm.

There was 65537 samples in dos-attack data. Some of them syn packet and some of them rst-ack answer packets. Classifier classified 65089 packets as a dos attack and 448 packets as a normal.

> print len(packets)
65537
> print result
{'dos': 65089, 'normal': 448}

And I tested model with normal pcap. Extracted data has 127 sample. Classifier classified all packets as normal.

> print len(packets)
127
> print result
{'normal': 127}

Lastly, these test results are very good. But may be unreliable. Please, create your model and test it different pcaps. If there is a mistake, please contact and we fix it together.

Note: I used KNN again in this kaggle competition because i didn't trust theese result very much. But surprisingly, success rate about 0.98 again 😃

Future Works


I want to work on this project a little more. Maybe I will implement kddcup99 feature extractor with scapy and i will make real time intrusion detection system. Also I want to try other machine learning algorithm on this field. If you interested in please contact.

References


[1]  KDD Cup '99 Dataset Considered Harmful

[2]  Feature Selection with Scikit-learn

[3]   Time Complexity of KNN

[*]   Data Preprocessing

Bye