# synack.blog

### Malicious Url Detection with Markov Model

Original work is here. This post is just a reproduction.

### Introduction

The executable files, registery keys and c&c domains created by malicious software are usually named with random generated strings. Main purpose of this method is to make it difficult to analyzing malware.

Mostly, malwares use different and random urls for c&c communication. They generate these urls with spesific and complicated DGA(Domain Generation Algorithm). In this way, malware produces different urls each time it is run. So, this process makes it very difficult to detection and prevention all possible domain traffic.

For example, the payment urls for the latest Petya malware are as follows. 

``````mischapuk6hyrn72.onion
petya3jxfp2f7g3i.onion
petya3sen7dyko2n.onion
mischa5xyix2mrhd.onion
mischapuk6hyrn72.onion``````

Also there are other urls which collected from some threat intelligence feeds.

``````yyttyytyt.net
vkjrosfox.com
sljjjupfgagolpg.ru
uftfesnodnjflwta.info
vxagtvsyqxtrfcm.com
wxphewjnfhlyyjj.net
xckjffnjivafxen.biz``````

It is easy to understand whether these urls are random by humans , but it is difficult for the machines. We will try to solve this problem with markov models.

### Markov Model

Markov model was found by the Russian mathematician Andrei Andreyevich Markov.

The Markov model will be defined in a very simple way avoiding hard mathematical equations. Markov model is a stochastic process in which the transition from one state to another is not dependent on previous states(memorylessness) but only on the current state. For example, according the graph above.

• The probability of passing rainy to a cloudy weather is 10%.
• The probability of passing rainy to a sunny weather is 40%.
• If the first 10 hours of the day or the first 1 minute of the day are rainy, these states have no effect on the next situation.
• These possibilities have no relationship with their predecessors.

### Implementation

We will take advantage of the frequencies of the English words when determining the likelihood of transition from one situation to another. When we find these frequencies we will first divide the words into n-grams, and then we record the frequency of the subsequent letter.

For example, the 2-grams of the computer word are as follows.

2-gram    Subsequent Letter
com
omp
mpu
put
ute
ter

We will repeat this process for all words in dataset.Results for a simple dataset are as follows.

``````#DATASET
computer
accomplished

#RESULT

co -> m     ac -> c     co -> {m:2}
om -> p     cc -> o     om -> {p:2}
mp -> u     co -> m     mp -> {u:1,l:1}
pu -> t     om -> p     pu -> {t:1}
ut -> e     mp -> l     ut -> {e:1}
te -> r     pl -> i     te -> {r:1}
li -> s     ac -> {c:1}
is -> h     cc -> {o:1}
sh -> e     pl -> {i:1}
he -> d     li -> {s:1}
is -> {h:1}
sh -> {e:1}
he -> {d:1}``````

After these steps, proportion of each letter will be calculated like this.

``````selected 2-gram ->  mp {u:1,l:1}

prob(u) = freq(u) / (freq(u) + freq(l))
prob(u) = 1 / (1+1)
prob(u) = 0.5

after this calculations 2-ngram will be --> mp {u:0.5,l:0.5}``````

A part of our model will be like this. Once the model has been created, the following steps are repeated to calculate whether a word is random.

1. Word is divided into ngrams.
2. The probability of the subsequent letter for each ngram is obtained from the model.
3. If this ngram or letter is not in the model, predetermined value(0.000001) used for this transition.
4. Logarithms(with base 2) of these values are summed
5. Word's evaluation value is calculated.
6. If evaluation value greater than total probability, word is random otherwise word is legit.
``````unknown_transition_prob = 0.000001
evaluation value = len(word) * log(2,unknown_transition_prob) / 2.5``````

### Results

I generated dataset from this document with my script which i use for extracting words from given document. And it works successfully with English. But algorithm don't work properly for other languages because of dataset. Actually mixing dataset in various languages affects performance badly.

Domain Probability Evaluation Value Status
yyttyytyt-79.72-71.75Random
vkjrosfox-105.20-71.75Random
sljjjupfgagolpg-201.31-119.58Random
uftfesnodnjflwta-279.04-127.56Random
vxagtvsyqxtrfcm-259.11-119.58Random
wxphewjnfhlyyjj-259.11-119.58Random
xckjffnjivafxen-242.76-119.58Random
mischapuk6hyrn72-229.06-127.56Random
petya3sen7dyko2n-245.15-127.56Random
mischa5xyix2mrhd-229.06-127.56Random
mischapuk6hyrn72-229.06-127.56Random
petya3jxfp2f7g3i-262.91-127.56Random
skype-1.32-39.86Legit
cyberstruggle-95.65-103.64Legit

### Demo

I used these domains as dataset without tlds in demo. Results may vary according to the table above.

Project repo is here