December 06, 2018 / by Students of the Center of Excellence (CoE) at VJTI in conjunction with DNIF / In ai-ml-cyber-security /

Modelling Behavioral Patterns Using Statistical Machine Learning Algorithms

It goes without saying that cyber threat and behavior analytics are an important part of cybersecurity. Today’s attackers are well aware of the static security measures employed against them, leaving security teams in a never-ending struggle to stay one step ahead. The demand of the time is a dynamic security system which can learn from existing attack patterns, predict an attacker’s next move and put preventive measures into place before an attacker has a chance to act maliciously. Moreover, by understanding how intruders interact with various honeypots, we can better understand their underlying intents. As Sun Tzu famously wrote in The Art of War, one must know one’s enemy before victory is certain.

As students at the Center of Excellence (CoE) lab at VJTI, we work on multiple projects. In this special project, we focused on modeling attackers’ behavioral patterns using statistical machine learning algorithms. Our team of 12 students includes B.Tech. students and M.Tech. students alike, pursuing everything from electronics to telecommunications. The algorithms we developed can predict an attacker’s actions with an accuracy greater than 70% using a hidden Markov model. We have the algorithm, we have the data, but we needed a versatile tool which could work with static data but also streaming data and allow us to run our algorithm on the data. At present we are working in a controlled environment for real world applications, we knew we would need a platform that would be able ingest copious amounts of data in real time. We achieved this through our own ingenuity and with the help of the DNIF real-time data analytics platform. With DNIF’s flexibility, we were able to ingest the raw data, have the data parsed and stored, apply our ML algorithm to that data and create the visual representation of it we desired, the results of which you are now viewing. The results obtained can be used to generate attack propagation graphs on which network analytics can be performed. Moreover, this can also be used to keep attackers trapped in a particular system for the purpose of isolating them or generating more logs—in other words, tricking the trickster; keeping attackers under the impression that they are succeeding.


The main source of data for the dashboard visualizations is a pair of honeypots set up in our lab. One is a spam mail filter, and the other is a fake filesystem which an attacker can interact with. This provides a good source for attacker profiling. By analyzing the steps that a user takes while interacting with the honeypot, this also helps to distinguish genuine users from malicious ones.

Figure A: Hits received globally by a honeypot

traffic received on a honeypot

Figure A is a visual representation of hits received globally by a honeypot. As shown by the colored scale, the greatest number of hits came from China, closely followed by Russia. This is an important step in building profiles.

Figure B - Interactions with a honeypot’s fake file system

interaction with honeypot's file system

These are the different actions that an attacker generally takes while interacting with the honeypot’s fake file system. These actions are taken to be the states of the Markov process underlying the hidden Markov model. This serves to build a profile of the attacker, and further analysis will help predict the most likely steps to be taken by an attacker. By predicting these steps, appropriate mitigation techniques can be put into place. The accuracy of the hidden Markov model is greater than 70%.

This system is designed to keep an attacker occupied with fake file systems, enabling analysts to learn about the patterns, likely thought processes and final objectives of an attacker. This information helps analysts predict an attacker’s next move and take preemptive action to limit the damage an attacker can cause.

We intend to extend the capabilities of existing algorithms making them more efficient and accurate in their attack detection. We also intend to optimise the algorithms, to run on resource-constrained devices and develop deployable preventive measures in case attacks or malicious activities have been detected in an environment. Our work to date has involved creating dashboards to depict the extent of various users’ interactions with a honeypot. These users are from around the world; through the dashboard, we can visually represent the country-wise distribution of hits on the honeypot. In addition, we highlight the distribution of the most likely steps taken by an attacker while interacting with the honeypot. This helps us to better profile attackers and implement suitable mitigation measures. These are some o the things we will cove in future blogs. Last but not least, we would like to thanks the DNIF team for their constant support and guidance, working together on this project has been a pleasure.