K-Means Clustering Use-Case in Security domain

3 min readJul 20, 2021

What is clustering?

Clustering is the task of dividing the data points into a number of groups such that each group is the cluster and data points in the same groups are more similar to other data points in the same group than those in the other groups.

What is Unsupervised learning?

Unsupervised learning is the type of machine learning in which models are trained using unlabeled dataset and are allowed to act on that data without any supervision.

What is K-Means Clustering?

K-means clustering is a very famous and powerful unsupervised machine learning algorithm tries to group similar items in the form of clusters. The number of groups is represented by “K”.

How does K-Means Clustering work?

Various kinds of distance measures are:

Euclidean distance measure
Manhattan distance measure
A squared euclidean distance measure
Cosine distance measure

K-Means Clustering Algorithm

The steps to form cluster are:

Step 1: Choose K random points as cluster centers called centroids.

Step 2 : Assign each data point to the closest cluster by implementing euclidean distance ( Calculating it’s distance to each centroid)

Step 3: Identify new centroids by taking the average of the assigned points.

Step 4: Keep repeating step 2 and step 3 until convergence is achieved.

K-Means Clustering in Security Domain:

1.Identifying crime localities

With data related to crimes available in specific localities in a city , the category of crime , area of crime , and the association between them give quality insight into crime-prone areas with in a city or locality.

2.Cyber -Profiling criminals

Cyber-profiling is the process of collecting data from individuals and groups to identify significant co-relations. The idea of cyber profiling is derived from criminal profiles , which provide information on the investigation division to classify the types of criminals who were at the crime scene.

3.Insurance fraud detection

Machine learning is used in the fraud detection. Utilizing past historical data on fraudulent claims , it is possible to isolate new claims based on its proximity to clusters that indicate fraudulent patterns.

4.Automatic clustering of it alerts

Large enterprise IT infrastructure technology components such as network , storage , or database generate large volumes of alert messages. Because alert messages potentially point to operational issues , they must be manually screened for prioritization for downstream process. Clustering of data can provide insight into categories of alerts and mean time to repair , and help in failure predictions.