### Malicious Url Detection with Markov Model

Original work is here. This post is just a reproduction.

### Introduction

The executable files, registery keys and c&c domains created by malicious software are usually named with random generated strings. Main purpose of this method is to make it difficult to analyzing malware.

Mostly, malwares use different and random urls for c&c communication. They generate these urls with spesific and complicated DGA(Domain Generation Algorithm). In this way, malware produces different urls each time it is run. So, this process makes it very difficult to detection and prevention all possible domain traffic.

For example, the payment urls for the latest Petya malware are as follows. [1]

```
mischapuk6hyrn72.onion
petya3jxfp2f7g3i.onion
petya3sen7dyko2n.onion
mischa5xyix2mrhd.onion
mischapuk6hyrn72.onion
```

Also there are other urls which collected from some threat intelligence feeds.[2]

```
yyttyytyt.net
vkjrosfox.com
sljjjupfgagolpg.ru
uftfesnodnjflwta.info
vxagtvsyqxtrfcm.com
wxphewjnfhlyyjj.net
xckjffnjivafxen.biz
```

It is easy to understand whether these urls are random by humans , but it is difficult for the machines. We will try to solve this problem with markov models.

### Markov Model

Markov model was found by the Russian mathematician Andrei Andreyevich Markov.

The Markov model will be defined in a very simple way avoiding hard mathematical equations. Markov model is a stochastic process in which the transition from one state to another is not dependent on previous states(memorylessness) but only on the current state.

For example, according the graph above.

- The probability of passing rainy to a cloudy weather is 10%.
- The probability of passing rainy to a sunny weather is 40%.
- If the first 10 hours of the day or the first 1 minute of the day are rainy, these states have no effect on the next situation.
- These possibilities have no relationship with their predecessors.

### Implementation

We will take advantage of the frequencies of the English words when determining the likelihood of transition from one situation to another. When we find these frequencies we will first divide the words into n-grams, and then we record the frequency of the subsequent letter.

For example, the 2-grams of the **computer** word are as follows.

2-gram | Subsequent Letter |
---|---|

co | m |

om | p |

mp | u |

pu | t |

ut | e |

te | r |

We will repeat this process for all words in dataset.Results for a simple dataset are as follows.

```
#DATASET
computer
accomplished
#RESULT
co -> m ac -> c co -> {m:2}
om -> p cc -> o om -> {p:2}
mp -> u co -> m mp -> {u:1,l:1}
pu -> t om -> p pu -> {t:1}
ut -> e mp -> l ut -> {e:1}
te -> r pl -> i te -> {r:1}
li -> s ac -> {c:1}
is -> h cc -> {o:1}
sh -> e pl -> {i:1}
he -> d li -> {s:1}
is -> {h:1}
sh -> {e:1}
he -> {d:1}
```

After these steps, proportion of each letter will be calculated like this.

```
selected 2-gram -> mp {u:1,l:1}
prob(u) = freq(u) / (freq(u) + freq(l))
prob(u) = 1 / (1+1)
prob(u) = 0.5
after this calculations 2-ngram will be --> mp {u:0.5,l:0.5}
```

A part of our model will be like this.

Once the model has been created, the following steps are repeated to calculate whether a word is random.

- Word is divided into ngrams.
- The probability of the subsequent letter for each ngram is obtained from the model.
- If this ngram or letter is not in the model, predetermined value(0.000001) used for this transition.
- Logarithms(with base 2) of these values are summed
- Word's evaluation value is calculated.
- If evaluation value[3] greater than total probability, word is random otherwise word is legit.

```
unknown_transition_prob = 0.000001
evaluation value = len(word) * log(2,unknown_transition_prob) / 2.5
```

### Results

I generated dataset from this document with my script which i use for extracting words from given document. And it works successfully with English. But algorithm don't work properly for other languages because of dataset. Actually mixing dataset in various languages affects performance badly.

Domain | Probability | Evaluation Value | Status |
---|---|---|---|

yyttyytyt | -79.72 | -71.75 | Random |

vkjrosfox | -105.20 | -71.75 | Random |

sljjjupfgagolpg | -201.31 | -119.58 | Random |

uftfesnodnjflwta | -279.04 | -127.56 | Random |

vxagtvsyqxtrfcm | -259.11 | -119.58 | Random |

wxphewjnfhlyyjj | -259.11 | -119.58 | Random |

xckjffnjivafxen | -242.76 | -119.58 | Random |

mischapuk6hyrn72 | -229.06 | -127.56 | Random |

petya3sen7dyko2n | -245.15 | -127.56 | Random |

mischa5xyix2mrhd | -229.06 | -127.56 | Random |

mischapuk6hyrn72 | -229.06 | -127.56 | Random |

petya3jxfp2f7g3i | -262.91 | -127.56 | Random |

-13.69 | -63.78 | Legit | |

skype | -1.32 | -39.86 | Legit |

cyberstruggle | -95.65 | -103.64 | Legit |

### Demo

I used these domains as dataset without tlds in demo. Results may vary according to the table above.

Project repo is here

### References and Notes

[1] Symantec Emerging Threat Report

[2] Alienvault Open Threat Exchange - Fortinet Blog Post about DGA

[3] I use @asselmant's formula for evaluation value. This formula sutiable for 2-grams, for different n values formula must be changed.

[*] Markov Chains Visually Explained

[*] Theory Behind Markov Chains