Linear regression – is used to determine how an outcome variable, called the dependent variable, linearly depends on a set of known variables, called the independent variables. The dependent variable is typically denoted by y and the independent variables are denoted by x1,x2,…xk, where k is the number of different independent variables. We are interested in finding the best possible coefficients β0,β1,β2,…βk such that our predicted values:
Are as close as possible to the actual y values. This is achieved by minimizing the sum of the squared differences between the actual values, y, and the predictions y^. These differences, (y−y^), are often called error terms or residuals.
Once you have constructed a linear regression model, it is important to evaluate the model by going through the following steps:
- Check the significance of the coefficients, and remove insignificant independent variables
- Check the R² value of the model.
- Check the predictive ability of the model on out-of-sample data.
- Check for multicollinearity.
Linear Regression in R – Suppose our training data frame is called “TrainData”, dependent variable is called “DepVar”, and we have two independent variables, called “IndepVar1” and “IndepVar2”. Then you can build a linear regression model in R called “RegModel” as below:
RegModel = lm(DepVar ~ IndepVar1 + IndepVar2, data = TrainData)
To see the R² of the model, the coefficients, and the significance of the coefficients, use the summary() function:
To check for multicollinearity, correlations can be computed with the cor() function:
cor(TrainData$IndepVar1, TrainData$IndepVar2) cor(TrainData)
If our out-of-sample data, or test set, is called “TestData”, we can compute test set predictions and the test set R² as below:
TestPredictions = predict(RegModel, newdata=TestData) SSE = sum((TestData$DepVar - TestPredictions)^2) SST = sum((TestData$DepVar - mean(TrainData$DepVar))^2) Rsquared = 1 - SSE/SST
Thus Rsquared does three way comparison. SSE : Test data with respect to prediction from model, SST :Test data with respect of training data.