Principal Component Analysis (PCA) with Python

Principal Component Analysis (PCA): is an algebraic technique for converting a set of observations of possibly correlated variables into the set of values of liner uncorrelated variables.

All principal components are chosen to describe most of the available variance in the variable, and all principal components are orthogonal to each other. In all the sets of the principal component first principal component will always have the maximum variance.

Different Uses of Principal Component Analysis:

PCA can be used for finding interrelations between various variables in the data.
PCA can be used for interpreting and visualizing the data sets.
PCA can also be used for visualizing genetic distance and connection between populations.
PCA also makes analysis simple with the decrease in the number of variables.

Principal component analysations are usually executed on a square symmetric matrix, and this can be a pure sum of squares and cross products matrix or correlation matrix or covariance matrix. The correlation matrix is used if there is a major difference in the individual variance.

What are the Objectives of Principal Component Analysis?

The basic objectives of PCA are as follows:

PCA is a nondependent method can be used for reducing attribute space from a larger number of variables of the set to a smaller number of factors.
It is a dimension reducing technique but with no assurance whether the dimension would be interpretable.
In PCA, the main job is selecting the subset of variables from a larger set, depending on which original variables will have the highest correlation with the principal amount.

Principal Axis Method: Principal Component Analysis searches for the linear combination of the variable for extracting maximum variance from the variables. Once the PCA is done with the process, it will move forward to another linear combination which will explain the maximum ratio of the remaining variance, which would lead to orthogonal factors of the sets. This method is used for analysing total variance in the variables of the set.

Eigen Vector: It is a nonzero vector that remains parallel after multiplying the matrix. Suppose ‘V’ is an eigen vector of dimension R of matrix K with dimension R * R. If KV and V are parallel. Then the user has to solve KV = PV where both V and P are unknown for solving eigen vector and eigen value.

Eigen Value: It is also known as “characteristic roots” in PCA. This is used for measuring the variance in all the variables of the set, which is reported for by that factor. The proportion of eigen value is the ratio of descriptive importance of the factors concerning the variables. If the factor is low, then it subsidises less to the description of variables.

Now, we will Discuss Principal Component Analysis with Python.

Following are the Steps for Using PCA with Python:

In this tutorial, we will use wine.csv Dataset.

Step 1: We will import the libraries.

<br />
import numpy as nmp<br />
import matplotlib.pyplot as mpltl<br />
import pandas as pnd<br />

Step 2: We will import the dataset (wine.csv)

First, we will import the dataset and distribute it into X and Y components for data analysis.

<br />
DS = pnd.read_csv(‘Wine.csv’)</p>
<p># Now, we will distribute the dataset into two components “X” and “Y”</p>
<p>X = DS.iloc[: , 0:13].values<br />Y = DS.iloc[: , 13].values<br />

Step 3: In this step, we will split the dataset into the training set and testing set.

<br />
from sklearn.model_selection import train_test_split as tts</p>
<p>X_train, X_test, Y_train, Y_test = tts(X, Y, test_size = 0.2, random_state = 0)<br />

Step 4: Now, we will Feature Scaling.

In this step, we will do the re-processing on the training and testing set, for example, fitting the standard scale.

<br />
from sklearn.preprocessing import StandardScaler as SS<br />
SC = SS()</p>
<p>X_train = SC.fit_transform(X_train)<br />X_test = SC.transform(X_test)<br />

Step 5: Then, Apply the PCA function

We will apply the PCA function into the training set and testing set for analysis.

<br />
from sklearn.decomposition import PCA</p>
<p>PCa = PCA (n_components = 1)</p>
<p>X_train = PCa.fit_transform(X_train)<br />X_test = PCa.transform(X_test)</p>
<p>explained_variance = PCa.explained_variance_ratio_<br />

Step 6: Now, we will fit Logistic Regression for the training set

<br />
from sklearn.linear_model import LogisticRegression as LR</p>
<p>classifier_1 = LR (random_state = 0)<br />classifier_1.fit(X_train, Y_train)<br />

Output:

LogisticRegression(random_state=0)

Step 7: Here, we will predict the testing set result:

<br />
Y_pred = classifier_1.predict(X_test)<br />

Step 8: We will create the confusion matrix.

<br />
from sklearn.metrics import confusion_matrix as CM</p>
<p>c_m = CM (Y_test, Y_pred)<br />

Step 9: Then, predict the result of the training set.

<br />
from matplotlib.colors import ListedColormap as LCM</p>
<p>X_set, Y_set = X_train, Y_train<br />X_1, X_2 = nmp.meshgrid(nmp.arange(start = X_set[:, 0].min() – 1,<br />                     stop = X_set[: , 0].max() + 1, step = 0.01),<br />                     nmp.arange(start = X_set[: , 1].min() – 1,<br />                     stop = X_set[: , 1].max() + 1, step = 0.01))</p>
<p>mpltl.contourf(X_1, X_2, classifier_1.predict(nmp.array([X_1.ravel(),<br />             X_2.ravel()]).T).reshape(X_1.shape), alpha = 0.75,<br />             cmap = LCM ((‘yellow’, ‘grey’, ‘green’)))</p>
<p>mpltl.xlim (X_1.min(), X_1.max())<br />mpltl.ylim (X_2.min(), X_2.max())</p>
<p>for s, t in enumerate(nmp.unique(Y_set)):<br />    mpltl.scatter(X_set[Y_set == t, 0], X_set[Y_set == t, 1],<br />                c = LCM ((‘red’, ‘green’, ‘blue’))(s), label = t)</p>
<p>mpltl.title(‘Logistic Regression for Training set: ‘)<br />mpltl.xlabel (‘PC_1’) # for X_label<br />mpltl.ylabel (‘PC_2’) # for Y_label<br />mpltl.legend() # for showing legend</p>
<p># show scatter plot<br />mpltl.show()<br />

Output:

Principal Component Analysis (PCA) with Python

Step 10: At last, we will visualize the result of the testing set.

<br />
from matplotlib.colors import ListedColormap as LCM</p>
<p>X_set, Y_set = X_test, Y_test</p>
<p>X_1, X_2 = nmp.meshgrid(nmp.arange(start = X_set[: , 0].min() – 1,<br />                     stop = X_set[: , 0].max() + 1, step = 0.01),<br />                     nmp.arange(start = X_set[: , 1].min() – 1,<br />                     stop = X_set[: , 1].max() + 1, step = 0.01))</p>
<p>mpltl.contourf(X_1, X_2, classifier_1.predict(nmp.array([X_1.ravel(),<br />             X_2.ravel()]).T).reshape(X_1.shape), alpha = 0.75,<br />             cmap = LCM((‘pink’, ‘grey’, ‘aquamarine’)))</p>
<p>mpltl.xlim(X_1.min(), X_1.max())<br />mpltl.ylim(X_2.min(), X_2.max())</p>
<p>for s, t in enumerate(nmp.unique(Y_set)):<br />    mpltl.scatter(X_set[Y_set == t, 0], X_set[Y_set == t, 1],<br />                c = LCM((‘red’, ‘green’, ‘blue’))(s), label = t)</p>
<p># title for scatter plot<br />mpltl.title(‘Logistic Regression for Testing set’)<br />mpltl.xlabel (‘PC_1’) # for X_label<br />mpltl.ylabel (‘PC_2’) # for Y_label<br />mpltl.legend()</p>
<p># show scatter plot<br />mpltl.show()<br />

Output:

Principal Component Analysis (PCA) with Python

Conclusion

In this tutorial, we have learned about principal component analysis with Python, its uses, and objects and how to use it on the data set to analyse the data’s testing and training sets.

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/tech/courses/263401.html

Principal Component Analysis (PCA) with Python

Principal Component Analysis (PCA) with Python

Different Uses of Principal Component Analysis:

What are the Objectives of Principal Component Analysis?

Following are the Steps for Using PCA with Python:

Conclusion

相关推荐

发表回复