DBSCAN algorithm in Python

In this tutorial, we will learn how we can implement and use the DBSCAN algorithm in Python.

In 1996, DBSCAN or Density-Based Spatial Clustering of Applications with Noise, a clustering algorithm, was first proposed, and it was awarded the ‘Test of Time’ award in the year 2014. The ‘Test of Time’ award was given to DBSCAN at Data Mining Conference, KDD. We will not learn about the DBSCAN algorithm here and only discuss the implementation of the DBSCAN algorithm in Python. But if we have to understand the implementation of the DBSCAN algorithm, we should have at least a basic idea about it. Therefore, if it is advisable that if you don’t know what the DBSCAN algorithm is or how it works, then you should first learn about the DBSCAN algorithm and its working.

Implementation of DBSCAN algorithm in Python

We will perform the implementation operation of the DBSCAN algorithm in this section, and we will do this in steps so that it will be easy to understand and learn. We are going to use a dataset in this implementation process to perform various operations (including those we do in the DBSCAN algorithm) on it. Before we start the implementation process, we should fulfil the prerequisites to implement the DBSCAN algorithm inside a Python program.

Prerequisites for implementation of DBSCAN algorithm:

We have to fulfil the following prerequisites before we proceed with the implementation part of the DBSCAN algorithm in this section:

1. Numpy library: We should make sure that the numpy library is installed in our system and that too of the latest version as we are going to use functions on the numpy library on the dataset we will use in the implementation process. If numpy library is not present in our system or we haven’t installed it before, then we can use the following command in the command prompt terminal present in our device to install it:

<br />
pip install numpy<br />

DBSCAN algorithm in Python

When we press the enter key, the numpy library is started installing in our system.

DBSCAN algorithm in Python

After some time, we will see that the numpy library is successfully installed in our system (Here, we already have the numpy library present in our system).

2. Panda library: Like numpy library, panda library is also the required library that should be present in our system, and if it is not present in our system, we can use the following command in the command prompt terminal to install it with pip installer:

<br />
pip install pandas<br />

3. matplotlib library: It is also an important library in the implementation process of the DBSCAN algorithm as functions of this library will help us display results from the dataset. If the matplotlib library is not present in our system, then we can use the following command in the command prompt terminal present to install it with pip installer:

<br />
pip install matplotlib<br />

4. Sklearn library: Sklearn library is going to be one of the major requirements while performing the implementation operation of the DBSCAN algorithm as we have to import various modules from the Sklearn library itself in the program, such as preprocessing decomposing etc. Therefore, we should make sure that the Sklearn library is present in our system or not, and if it is not present in our system, then we can use the following command in the command prompt terminal present to install it with pip installer:

<br />
pip install matplotlib<br />

5. Last but not least, we should also be aware of the DBSCAN algorithm (what it is and how it works), as we have discussed already, so that we can easily understand the implementation of it in Python.

Before we move forward, we should make sure that we have fulfilled all the prerequisites that we have listed down above so that we don’t have to face any problems while following the implementation steps.

Implementation steps for the DBSCAN algorithm:

Now, we will perform the implementation of the DBSCAN algorithm in Python. Still, we will do this in steps as we have mentioned earlier so that the implementation part does not get any complex, and we can understand it very easily. We have to follow the following steps in order to implement the DBSCAN algorithm and its logic inside a Python program:

Step 1: Importing all the required libraries:

First and foremost, we have to import all the required libraries which we have installed in the prerequisites part so that we can use their functions while implementing the DBSCAN algorithm.

Here, we have firstly imported all the required libraries or modules of libraries inside the program:

<br />
# Importing numpy library as nmp<br />
import numpy as nmp<br />
# Importing pandas library as pds<br />
import pandas as pds<br />
# Importing matplotlib library as pplt<br />
import matplotlib.pyplot as pplt<br />
# Importing DBSCAN from cluster module of Sklearn library<br />
from sklearn.cluster import DBSCAN<br />
# Importing StandardSclaer and normalize from preprocessing module of Sklearn library<br />
from sklearn.preprocessing import StandardScaler<br />
from sklearn.preprocessing import normalize<br />
# Importing PCA from decomposition module of Sklearn<br />
from sklearn.decomposition import PCA<br />

Step 2: Loading the Data:

In this step, we have to load that data, and we can do this by importing or loading the dataset (that is required in the DBSCAN algorithm to work on it) inside the program. To load the dataset inside the program, we will use the read.csv() function of the panda’s library and print the information from the dataset as we have done below:

<br />
# Loading the data inside an initialized variable<br />
M = pds.read_csv(‘sampleDataset.csv’) # Path of dataset file<br />
# Dropping the CUST_ID column from the dataset with drop() function<br />
M = M.drop(‘CUST_ID’, axis = 1)<br />
# Using fillna() function to handle missing values<br />
M.fillna(method =’ffill’, inplace = True)<br />
# Printing dataset head in output<br />
print(M.head())<br />

Output:

       BALANCE  BALANCE_FREQUENCY  ...  PRC_FULL_PAYMENT  TENURE
0    40.900749           0.818182  ...          0.000000      12
1  3202.467416           0.909091  ...          0.222222      12
2  2495.148862           1.000000  ...          0.000000      12
3  1666.670542           0.636364  ...          0.000000      12
4   817.714335           1.000000  ...          0.000000      12

[5 rows x 17 columns]

The data as given in the output above will be printed when we run the program, and we will work on this data from the dataset file we loaded.

Step 3: Preprocessing the data:

Now, we will start preprocessing the data of the dataset in this step by using the functions of preprocessing module of the Sklearn library. We have to use the following technique while preprocessing the data with Sklearn library functions:

<br />
# Initializing a variable with the StandardSclaer() function<br />
scalerFD = StandardScaler()<br />
# Transforming the data of dataset with Scaler<br />
M_scaled = scalerFD.fit_transform(M)<br />
# To make sure that data will follow gaussian distribution<br />
# We will normalize the scaled data with normalize() function<br />
M_normalized = normalize(M_scaled)<br />
# Now we will convert numpy arrays in the dataset into dataframes of panda<br />
M_normalized = pds.DataFrame(M_normalized)<br />

Step 4: Reduce the dimensionality of the data:

In this step, we will be reducing the dimensionality of the scaled and normalized data so that the data can be visualized easily inside the program. We have to use the PCA function in the following way in order to transform the data and reduce its dimensionality:

<br />
# Initializing a variable with the PCA() function<br />
pcaFD = PCA(n_components = 2) # components of data<br />
# Transforming the normalized data with PCA<br />
M_principal = pcaFD.fit_transform(M_normalized)<br />
# Making dataframes from the transformed data<br />
M_principal = pds.DataFrame(M_principal)<br />
# Creating two columns in the transformed data<br />
M_principal.columns = [‘C1’, ‘C2’]<br />
# Printing the head of the transformed data<br />
print(M_principal.head())<br />

Output:

         C1        C2
0 -0.489949 -0.679976
1 -0.519099  0.544828
2  0.330633  0.268877
3 -0.481656 -0.097610
4 -0.563512 -0.482506

As we can see in the output, we have transformed the normalized data into two components which is the two columns (we can see them in the output), using the PCA. And, after that, we made dataframes from transformed data using the panda library dataframe() function.

Step 5: Build a clustering model:

Now, this is the most important step of the implementation as here we have to build a clustering model of the data (on which we are performing operations), and we can do this by using the DBSCAN function of the Sklearn library as we have used below:

<br />
# Creating clustering model of the data using the DBSCAN function and providing parameters in it<br />
db_default = DBSCAN(eps = 0.0375, min_samples = 3).fit(M_principal)<br />
# Labelling the clusters we have created in the dataset<br />
labeling = db_default.labels_<br />

Step 6: Visualize the clustering model:

<br />
# Visualization of clustering model by giving different colours<br />
colours = {}<br />
# First colour in visualization is green<br />
colours[0] = ‘g’<br />
# Second colour in visualization is black<br />
colours[1] = ‘k’<br />
# Third colour in visualization is red<br />
colours[2] = ‘r’<br />
# Last colour in visualization is blue<br />
colours[-1] = ‘b’<br />
# Creating a colour vector for each data point in the dataset cluster<br />
cvec = [colours[label] for label in labeling]<br />
# Construction of the legend<br />
# Scattering of green colour<br />
g = pplt.scatter(M_principal[‘C1’], M_principal[‘C2′], color =’g’);<br />
# Scattering of black colour<br />
k = pplt.scatter(M_principal[‘C1’], M_principal[‘C2′], color =’k’);<br />
# Scattering of red colour<br />
r = pplt.scatter(M_principal[‘C1’], M_principal[‘C2′], color =’r’);<br />
# Scattering of green colour<br />
b = pplt.scatter(M_principal[‘C1’], M_principal[‘C2′], color =’b’);<br />
# Plotting C1 column on the X-Axis and C2 on the Y-Axis<br />
# Fitting the size of the figure with figure function<br />
pplt.figure(figsize =(9, 9))<br />
# Scattering the data points in the Visualization graph<br />
pplt.scatter(M_principal[‘C1’], M_principal[‘C2’], c = cvec)<br />
# Building the legend with the coloured data points and labelled<br />
pplt.legend((g, k, r, b), (‘Label M.0’, ‘Label M.1’, ‘Label M.2’, ‘Label M.-1’))<br />
# Showing Visualization in the output<br />
pplt.show()<br />

Output:

DBSCAN algorithm in Python

As we can see in the output, we have plotted the graph using the data points of the dataset and visualized the clustering by labelling the data points with different colours.

Step 7: Tuning the parameters:

In this step, we will be tuning the parameters of the module by changing the parameters that we have previously given in the DBSCAN function as follow:

<br />
# Tuning the parameters of the model inside the DBSCAN function<br />
dts = DBSCAN(eps = 0.0375, min_samples = 50).fit(M_principal)<br />
# Labelling the clusters of data points<br />
labeling = dts.labels_<br />

Step 8: Visualization of the changes:

Now, after tuning the parameters of the cluster model we created, we will visualize the changes that will come in the cluster by labelling the data points in the dataset with different colours as we have done before.

<br />
# Labelling with different colours<br />
colours1 = {}<br />
# labelling with Red colour<br />
colours1[0] = ‘r’<br />
# labelling with Green colour<br />
colours1[1] = ‘g’<br />
# labelling with Blue colour<br />
colours1[2] = ‘b’<br />
colours1[3] = ‘c’<br />
# labelling with Yellow colour<br />
colours1[4] = ‘y’<br />
# Magenta colour<br />
colours1[5] = ‘m’<br />
# labelling with Black colour<br />
colours1[-1] = ‘k’<br />
# Labelling the data points with the colour variable we have defined<br />
cvec = [colours1[label] for label in labeling]<br />
# Defining all colour that we will use<br />
colors = [‘r’, ‘g’, ‘b’, ‘c’, ‘y’, ‘m’, ‘k’ ]<br />
# Scattering the colours onto the data points<br />
r = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[0])<br />
g = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[1])<br />
b = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[2])<br />
c = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[3])<br />
y = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[4])<br />
m = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[5])<br />
k = pplt.scatter(<br />
M_principal[‘C1’], M_principal[‘C2′], marker =’o’, color = colors[6])<br />
# Fitting the size of the figure with figure function<br />
pplt.figure(figsize =(9, 9))<br />
# Scattering column 1 into X-axis and column 2 into y-axis<br />
pplt.scatter(M_principal[‘C1’], M_principal[‘C2’], c = cvec)<br />
# Constructing a legend with the colours we have defined<br />
pplt.legend((r, g, b, c, y, m, k),<br />
(‘Label M.0’, ‘Label M.1’, ‘Label M.2’, ‘Label M.3’, ‘Label M.4′,’Label M.5’, ‘Label M.-1′), # Using different labels for data points<br />
scatterpoints = 1, # Defining the scatter point<br />
loc =’upper left’, # Location of cluster scattering<br />
ncol = 3, # Number of columns<br />
fontsize = 10) # Size of the font<br />
# Displaying the visualisation of changes in cluster scattering<br />
pplt.show()<br />

Output:

DBSCAN algorithm in Python

We can clearly observe the changes that have come in the cluster scattering of data points by tuning the parameters of the DBSCAN function by looking at the output. As we will observe the changes, we can also understand how the DBSCAN algorithm works and how it is helpful in the Visualization of cluster scattering of data points present in a dataset.

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/263739.html

DBSCAN algorithm in Python

DBSCAN algorithm in Python

Implementation of DBSCAN algorithm in Python

Prerequisites for implementation of DBSCAN algorithm:

Implementation steps for the DBSCAN algorithm:

相关推荐

发表回复