Linear Regression

Linear regression – is used to determine how an outcome variable, called the dependent variable, linearly depends on a set of known variables, called the independent variables. The dependent variable is typically denoted by y and the independent variables are denoted by x1,x2,xk, where k is the number of different independent variables. We are interested in finding the best possible coefficients β0,β1,β2,βk such that our predicted values:

y^=β0+β1x1+β2x2++βkxk

Are as close as possible to the actual y values. This is achieved by minimizing the sum of the squared differences between the actual values, y, and the predictions y^. These differences, (yy^), are often called error terms or residuals.
Once you have constructed a linear regression model, it is important to evaluate the model by going through the following steps:

  • Check the significance of the coefficients, and remove insignificant independent variables
  • Check the  value of the model.
  • Check the predictive ability of the model on out-of-sample data.
  • Check for multicollinearity.

Linear Regression in R –  Suppose our training data frame is called “TrainData”, dependent variable is called “DepVar”, and we have two independent variables, called “IndepVar1” and “IndepVar2”. Then you can build a linear regression model in R called “RegModel” as below:

RegModel = lm(DepVar ~ IndepVar1 + IndepVar2, data = TrainData)

To see the  of the model, the coefficients, and the significance of the coefficients, use the summary() function:

summary(RegModel)

To check for multicollinearity, correlations can be computed with the cor() function:

cor(TrainData$IndepVar1, TrainData$IndepVar2)
cor(TrainData)

If our out-of-sample data, or test set, is called “TestData”, we can compute test set predictions and the test set  as below:

TestPredictions = predict(RegModel, newdata=TestData)
SSE = sum((TestData$DepVar - TestPredictions)^2)
SST = sum((TestData$DepVar - mean(TrainData$DepVar))^2)
Rsquared = 1 - SSE/SST

Thus Rsquared does three way comparison. SSE : Test data with respect to prediction from model, SST :Test data with respect of training data.

Advertisements