Use of Confusion Matrix in Cyber Security

Anupam Kumar Thakur
5 min readJun 6, 2021

What is Confusion Matrix?

It is one of the approaches to find the accuracy of a trained Machine Learning model. It helps to find the accuracy of a model based on binary classification. As the name shows, it is really confusing to understand so it is called Confusion Matrix. It is basically a 2 * 2 matrix, that contains some numerical values. Each of its columns has its own meaning and importance. So here I am going to discuss one by one all those 4 cells of this matrix.

  1. TN (True Negatives): This cell contains the number of the predicted things which were negative and in actual these are negative too.
  2. FP (False Positives): It is also called a Type I error. This shows the number of positive but false predictions by the model. This means the prediction was actually negative and the model predicted it as positive.
  3. FN (False Negatives): It is also called a Type II error. It tells that machine has predicted the negative things falsely. This means in actual these were Positive but predicted as these were negative.
  4. TP (True Positives): This shows the number of positive and true predictions by the model.

NOTE: The meaning of positive and Negative will be based on the prediction. Eg. The sensors, alarms, lights are placed to find out any attack. So, if the machine is finding out the attacks then it will be assumed as positive. This means it is full filling the purpose for which it was placed.

First, we will understand it with an example then we will see its use case in cybersecurity. Suppose there are 150 students in a class and the machine has to predict who will pass or who will not. Machine predicted that 100 will pass and 50 will fail. But among 100, 80 are actually passed and 20 are failed. Here, if a student passes it's a positive thing and if he/she fails is a negative thing. So, by the above definition;

TP = 80: Among 100 predicted as passed, 80 were truly passed.

TP = 20: Among 100 predicted as passed, 20 were not passed (means fail). And it is predicted falsely by the model.

Now, the model has predicted 50 will fail. Among those 50, 30 were failed in actual and 20 were not fail. To fail in the exam is a negative thing so we will assume it as negative. So, by the above definition;

TN= 30: Among 50 predicted as failed, 30 were failed. This means these 30 students who were predicted as fail(negative) were the true prediction by the model.

TN= 20: Among 50 predicted as failed, 20 were not failed(negative) but the model predicted falsely here.

Among these 4 the FP (Type I) is so much dangerous. Just think that the model is predicting those students passed who are not passed in actuality. So, if it is a very higher level competitive exam for some specific post then it will be very dangerous. Because non-deserving people will join the organization.

Formula of accuracy

Use case of confusion matrix in CyberSecurity

We will understand it in a better way through an example. Suppose a company has trained a machine learning model to detect all the cyberattacks on its server. So, many people are visiting the server in a day and the model is detecting the attacks. If someone visited as an attacker, it will represent it by 0(negative) and if the visitor is normal it will be represented by 1 (positive). It is something like binary classification based on two labels, either 0 or 1. For security reasons alarms, light, automatic buzzers are connected with this model. So, that if it finds any suspicious user, all the connected equipment will start to give the signals. Today 1000 people have visited the server and the model return data based on the confusion matrix. So, that accuracy can be found.

Detected record by the model

The model predicted that 800 people were normal (positive). Among this 800 650 was normal in actual means predicted by the model truly (TP = 650). So, among 800, 150 were not normal but the model has predicted them as normal (FP = 150). As we have discussed above that False Positives are the most dangerous so these 150 attackers will be able to enter into the server and the system is still unaware of it. These attackers will affect the system and the system will know about it after this will happen.

The next case will be like the model has detected that 200 peoples are suspicious. But the actual data says only 150 attackers (Negatives)were detected truly by the model (TN = 150). Among these 200, 50 visitors were not attackers (Negatives)in actual but they were assumed as attackers (FN = 50). For these 50 visitors, the alarms and all will give the signal and the cybersecurity expert will check it. Maybe they will find that they are normal users but they have to be worry and check the system uselessly for some time.

Here, we have seen two cases were fine TP and TN. For TP, no issues because all the visitors were normal. For TN, also no issues, because all the visitors were attackers and alarms, gave the signal, and systems were checked. They restrict the attackers and made the system secure. But the other two cases were showing the error FP (Type I error) and FN (Type II error). For FN, it will be tedious for the cybersecurity engineers to check the systems even if the visitors were normal people. But the most dangerous thing that will happen with FP case, the visitors who are attackers will be assumed as normal people and the alarm will not give any signal. This means the model is letting the attackers do what they want and no one will find them. In this case, the system might be hacked completely.

CONCLUSION

The above paragraphs show how a machine learning model can provide us security and how it might be very dangerous in some cases. Since none of the models is 100% accurate, so it will be very silly to depend on the model-based security completely. There must be some manual approach for rechecking the detections. Hope I have made you understand the concept of the confusion matrix in the cybersecurity world.

THANKS FOR READING IT!!!

--

--

Anupam Kumar Thakur

AWS | DOCKER | ANSIBLE | KUBERNETES | JENKINS | MACHINE LEARNING | DEEP LEARNING | REDHAT