Implementation of Linear Regression using Python

Linear regression is a statistical technique to describe relationships between dependent variables with a number of independent variables. This tutorial will discuss the basic concepts of linear regression as well as its application within Python.

In order to give an understanding of the basics of the concept of linear regression, we begin with the most basic form of linear regression, i.e., “Simple linear regression”.

Simple Linear Regression

Simple linear regression (SLR) is a method to predict a response using one feature. It is believed that both variables are linearly linked. Thus, we strive to find a linear equation that can predict an answer value(y) as precisely as possible in relation to features or the independently derived variable(x).

Let’s consider a dataset in which we have a number of responses y per feature x:

Implementation of Linear Regression using Python

For simplification, we define:

x as feature vector, i.e., x = [x₁, x₂, x_3, …., x_n],

y as response vector, i.e., y = [y₁, y₂, y₃ …., y_n]

for n observations (in above example, n = 10).

A scatter plot of the above dataset looks like: –

Implementation of Linear Regression using Python

The next step is to identify the line that is most suitable for this scatter graph so that we can anticipate the response of any new value for a feature. (i.e., the value of x is that is not in the dataset)

This line is referred to as the regression line.

The equation of the regression line can be shown as follows:

Implementation of Linear Regression using Python

Here,

h(x_i ) signifies the predicted response value for i^th
?₀ and ?_{1x_i} ) are regression coefficients and represent y-intercept and slope of regression line respectively.

In order to build our model, we need to “learn” or estimate the value of the regression coefficients and . After we’ve determined those coefficients, then we are able to make use of this model in order to forecast the response!

In this tutorial, we’re going to employ the concept of Least Squares.

Let’s consider:

y_i = ?₀+ ?_{1x_i} + ?_i=h(x_i )+ ?_i ? ?_i= y_i– h(x_i )

Here, ?_i is a residual error in i^th observation.

So, our goal is to minimize the total residual error.

We have defined the cost function or squared error, J as:

Implementation of Linear Regression using Python

and our mission is to find the value of ?₀ and ?₁ for which J(?₀,?₁) is minimum.

Without going into the mathematical details, we are presenting the result below:

Implementation of Linear Regression using Python

Where, ss_xy would be the sum of the cross deviations of “y” and “x”:

Implementation of Linear Regression using Python

And ss_xx would be the sum of squared deviations of “x”

Implementation of Linear Regression using Python

Code:

<br />
import numpy as nmp<br />
import matplotlib.pyplot as mtplt</p>
<p>def estimate_coeff(p, q):<br /># Here, we will estimate the total number of points or observation<br />	n1 = nmp.size(p)<br /># Now, we will calculate the mean of a and b vector<br />	m_p = nmp.mean(p)<br />	m_q = nmp.mean(q)</p>
<p># here, we will calculate the cross deviation and deviation about a<br />	SS_pq = nmp.sum(q * p) – n1 * m_q * m_p<br />	SS_pp = nmp.sum(p * p) – n1 * m_p * m_p</p>
<p># here, we will calculate the regression coefficients<br />	b_1 = SS_pq / SS_pp<br />	b_0 = m_q – b_1 * m_p</p>
<p>	return (b_0, b_1)</p>
<p>def plot_regression_line(p, q, b):<br /># Now, we will plot the actual points or observation as scatter plot<br />	mtplt.scatter(p, q, color = “m”,<br />			marker = “o”, s = 30)</p>
<p># here, we will calculate the predicted response vector<br />	q_pred = b[0] + b[1] * p</p>
<p># here, we will plot the regression line<br />	mtplt.plot(p, q_pred, color = “g”)</p>
<p># here, we will put the labels<br />	mtplt.xlabel(‘p’)<br />	mtplt.ylabel(‘q’)</p>
<p># here, we will define the function to show plot<br />	mtplt.show()</p>
<p>def main():<br /># entering the observation points or data<br />	p = np.array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])<br />	q = np.array([11, 13, 12, 15, 17, 18, 18, 19, 20, 22])</p>
<p># now, we will estimate the coefficients<br />	b = estimate_coeff(p, q)<br />	print(“Estimated coefficients are :/nb_0 = {} /<br />		/nb_1 = {}”.format(b[0], b[1]))</p>
<p># Now, we will plot the regression line<br />	plot_regression_line(p, q, b)</p>
<p>if __name__ == “__main__”:<br />	main()<br />

Output:

Estimated coefficients are :
b_0 = -0.4606060606060609 		
b_1 = 1.1696969696969697

Multiple linear regression

Multiple linear regression attempts explain how the relationships are among several elements and then respond by applying a linear equation with the data. Clearly, it’s not anything more than an extension of linear regression.

Imagine a set of data that has one or more features (or independent variables) as well as one response (or dependent variable).

The dataset also is comprised of additional n rows/observations.

We define:

X(Feature Matrix) = it is a matrix of size “n * p” where “x_ij” represents the values of j^th attribute for the i^th observation.

Therefore,

Implementation of Linear Regression using Python

And,

y (Response Vector) = It is a vector of size n where represents the value of response for the i^th observation.

Implementation of Linear Regression using Python

The regression line for “p” features are represented as:

Implementation of Linear Regression using Python

Where h(x_i) is the predicted response value for the i^th observation point and ?₀,?₁,?₂,….,?_p are the regression coefficients.

We can also write:

Implementation of Linear Regression using Python

Where, ?_i is representing the residual error in the ith observation point.

We can also generalize our linear model a little more by representing the attribute matrix of “X” as:

Implementation of Linear Regression using Python

Therefore, the linear model can be expressed in the terms of matrices as shown below:

y=X?+?

Where,

Implementation of Linear Regression using Python

We now determine an estimation of the b, i.e., b’ by with an algorithm called the Least Squares method. As previously mentioned, this Least Squares method is used to find b’ for the case where the residual error of total is minimized.

We will present the results as shown below:

Implementation of Linear Regression using Python

Where ‘ is the matrix’s transpose and -1 is the matrix’s that is the reverse.

With the help of the lowest square estimates b’, the multi-linear regression model is now calculated by:

Implementation of Linear Regression using Python

where y’ is the estimated response vector.

Code:

<br />
import matplotlib.pyplot as mtpplt<br />
import numpy as nmp<br />
from sklearn import datasets as DS<br />
from sklearn import linear_model as LM<br />
from sklearn import metrics as mts</p>
<p># First, we will load the boston dataset<br />boston1 = DS.load_boston(return_X_y = False)</p>
<p># Here, we will define the feature matrix(H) and response vector(f)<br />H = boston1.data<br />f = boston1.target</p>
<p># Now, we will split X and y datasets into training and testing sets<br />from sklearn.model_selection import train_test_split as tts<br />H_train, H_test, f_train, f_test = tts(H, f, test_size = 0.4,<br />													random_state = 1)</p>
<p># Here, we will create linear regression object<br />reg1 = LM.LinearRegression()</p>
<p># Now, we will train the model by using the training sets<br />reg1.fit(H_train, f_train)</p>
<p># here, we will print the regression coefficients<br />print(‘Regression Coefficients are: ‘, reg1.coef_)</p>
<p># Here, we will print the variance score: 1 means perfect prediction<br />print(‘Variance score is: {}’.format(reg1.score(H_test, f_test)))</p>
<p># Here, we will plot for residual error</p>
<p># here, we will set the plot style<br />mtpplt.style.use(‘fivethirtyeight’)</p>
<p># here we will plot the residual errors in training data<br />mtpplt.scatter(reg1.predict(H_train), reg1.predict(H_train) – f_train,<br />			color = “green”, s = 10, label = ‘Train data’)</p>
<p># Here, we will plot the residual errors in test data<br />mtpplt.scatter(reg1.predict(H_test), reg1.predict(H_test) – f_test,<br />			color = “blue”, s = 10, label = ‘Test data’)</p>
<p># Here, we will plot the line for zero residual error<br />mtpplt.hlines(y = 0, xmin = 0, xmax = 50, linewidth = 2)</p>
<p># here, we will plot the legend<br />mtpplt.legend(loc = ‘upper right’)</p>
<p># now, we will plot the title<br />mtpplt.title(“Residual errors”)</p>
<p># here, we will define the method call for showing the plot<br />mtpplt.show()<br />

Output:

Regression Coefficients are:  [-8.95714048e-02  6.73132853e-02  5.04649248e-02  2.18579583e+00
 -1.72053975e+01  3.63606995e+00  2.05579939e-03 -1.36602886e+00
  2.89576718e-01 -1.22700072e-02 -8.34881849e-01  9.40360790e-03
 -5.04008320e-01]
Variance score is: 0.7209056672661751

In the above example, we calculate the accuracy score using the Explained Variance Score.

We define:

explained_variance_score = 1 – Var{y – y’}/Var{y}

where y’ is the estimated output target, where y is the equivalent (correct) to the target’s output, where Var is Variance, which is the square of the standard deviation.

The most ideal score is 1.0. Lower scores are worse.

Assumptions:

Here are the main assumptions that the linear regression model is based on with regard to the dataset on the basis on which it is utilized:

Linear relationships: Relationship between feature and response variables must be linear. The assumption of linearity is tested with the help of scatter plots. As we can see, the 1st figure is a representation of linearly related variables, whereas the variables in the 3rd and 2nd figures are probably non-linear. Thus, 1st figure can make more accurate predictions by using linear regression.
Little or no multi-collinearity: is the assumption that there is minimal or no multicollinearity present in the data. Multi-collinearity happens when the features (or independent variables) aren’t independent of each other.
Little or no auto-correlation: Another theory could be that there’s not much or no autocorrelation within the data. Autocorrelation is when residual errors aren’t independent of one another.
Homoscedasticity: is refer to an instance where errors are a factor (that is, the “noise” or random disturbance in the relationship between independent variables and dependent variables) that remains the same for all independent variables. Figure 1 is homoscedastic, while figure 2 displays heteroscedasticity.

Implementation of Linear Regression using Python

At the end of the tutorial, we’ll discuss some of the applications of linear regression.

Applications:

Following are the fields of applications based on Linear Regression:

Trend lines: illustrate the changes in the quantity of data as the passing in time (like GDP or oil prices, for instance.). They typically have a linear connection. Therefore, linear regression could be used to predict future value. However, the method is not able to meet the requirements of scientific credibility in situations when other possible changes may alter the data.
Economics: Linear Regression is the primary tool used in economics. It can be employed to predict the amount of spending by consumers and fixed investment spending investments in inventory, purchases of exports of a country, spending on imports, demand to store liquid assets, demand for labour, and supply.
Financial: A model of capital value assets makes use of linear regression to study and quantify the risk factors of investing.
Biology: Linear regression is a method to explain causal relationships between variables in biological systems.

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/263171.html

Implementation of Linear Regression using Python

Implementation of Linear Regression using Python

Simple Linear Regression

Multiple linear regression

Assumptions:

Applications:

相关推荐

发表回复