Malware Detection using Machine Learning

Introduction

Malware has been a crucial problem in the software world and if it is detected early, it can be mitigated. In this article, we will learn how to detect the intrusion of Malware using Machine Learning. The basic approach is a combination of various features from a training set and building a Machine Learning model. We will be using xgboost package for modelling and calculating the log loss error for various samples.

Shout-out to the Kaggle team that won the Microsoft competition for giving us a head-start for this project.

Architecture

Consider the architecture given above, we generate 4-gram byte, single byte frequency, instruction count, function names and Derived assembly features. We also generate three golden features (Opcode count, Header count & asm pixel intensity). There are over 71k features and it is very sparse. We use random forest to select useful features and finally around 4k features have been selected. In this article, we compare the time extraction of features varying by different sample size.

We use simple semi-supervised model to classify these malware into 9 classes based on these features. Cross validation is used to solve overfitting issue and log loss is key metrics to assess the model efficiency. The Architecture explains the over view of the approach. So, we start off with the extracting of features then move on to gradient boosting and further developing the model. The model will classify the train data of Malware into different families based on the features extracted and thus, training the model efficiently.

Building Models

The single model is linked with feature extraction and classification. So, after the features are extracted, the single model will classify the malware accordingly. Test data is used for verifying the efficiency of the model and log loss error. The training data is used to create the model. The Test data will be the new malware files that will test the training model and prove the efficiency of the model. When the model comes across a new malware, it will extract the feature and check if it matches the existing classes.

Results

Log Loss error

Figure 2 shows a graph between log loss obtained vs the samples used, The X-axis consists of samples 50,100….500 and the y-axis consists of log loss error from 0.5 to 2.5. We can observe in the graph that there is a decrease in log loss error as the sample size increases. Except, there is an increase from 150 to 250 and there is another decrase from 250 to 500. This proves our analysis that the log loss error is minimised

Train Processing time

Figure 3 shows us the graph between processing time and different samples used. You can see the exponential increase in the graph. We use samples 50,100,150,250,300 and we calculate the amount of time taken to process each sample. Further a graph is plotted to prove our analysis. The X axis shows the samples used and Y-axis shows the time taken to train the samples.

Feature Extraction time

Figure 4 shows the y-axis as extraction time and x-axis consists of samples from 50 to 500. You can see the graduate increase in extraction time as the sample size increases.

Optimisation

Initially, we had implemented a basic approach. It is a combination of extracting various features from the training set and building a machine learning model. We noticed that this approach to execute the scripts consumed a lot of memory and time. Hence, we came up with a solution that will reduce the executing time and memory consumption.

We came up with executing the feature scripts in two different instances. The reason why we chose two instances- There are feature extraction scripts that are dependent on each other. So, we split the extraction of features in terms of dependency. To prove our analysis, first, we ran the scripts separately one by one to observe the processing time.

Second, we ran them in parallel. We compared the processing time and RAM consumption of features extracted, plotted graphs for log-loss calculation and data flow. We were able to determine that the execution time reduced by 70% compared to the earlier method of feature extraction. We used python RAM analysis to calculate the RAM consumption for both serial and parallel execution of feature scripts.

Ram Consumption

We calculated the RAM consumption for all the feature scripts starting from unique gram to OPcount generation. Again, we used the sample size data ranging from 100 to 500. Unique gram consumes the maximum RAM, the feature generation of n grams needs a lot of memory.

Processing Time

We calculated the processing for all the feature scripts starting from unique gram to OPcount generation. Again, we used the sample size data ranging from 100 to 500. You can see that the scripts run faster in parallel. So, it takes almost double the time if you use the earlier approach. The maximum time taken was by the OP count. So, if you execute the feature extraction in parallel, the entire process takes the same amount of time as executing OP count separately.

Conclusion

This article covers the different steps taken to detect Malware intrusion using Machine Learning. At the beginning, we covered how to extract key features and develop Machine Learning models. Further, we classified the training set into different Malware families and trained the efficiency of the model. Finally, we proposed an optimisation feature, where we follow parallel processing approach that reduces the memory consumption and execution time considerably.