**Linear regression –** is used to determine how an outcome variable, called the ** dependent **variable, linearly depends on a set of known variables, called the

*variables. The dependent variable is typically denoted by*

**independent****y**and the independent variables are denoted by x1,x2,…xk, where k is the number of different independent variables. We are interested in finding the best possible coefficients β0,β1,β2,…βk such that our predicted values:

*y^=β0+β1x1+β2x2+…+βkxk*

Are as close as possible to the actual **y** values. This is achieved by minimizing the sum of the squared differences between the actual values, y, and the predictions **y^**. These differences, **(y−y^)**, are often called * error terms* or

*.*

**residuals**Once you have constructed a linear regression model, it is important to evaluate the model by going through the following steps:

*Check the significance of the coefficients, and remove insignificant independent variables**Check the***R²**value of the model.*Check the predictive ability of the model on out-of-sample data.**Check for multicollinearity.*

**Linear Regression in R – **Suppose our training data frame is called “TrainData”, dependent variable is called “DepVar”, and we have two independent variables, called “IndepVar1” and “IndepVar2”. Then you can build a linear regression model in R called “RegModel” as below:

RegModel = lm(DepVar ~ IndepVar1 + IndepVar2, data = TrainData)

To see the **R²** of the model, the coefficients, and the significance of the coefficients, use the * summary()* function:

summary(RegModel)

To check for multicollinearity, correlations can be computed with the * cor()* function:

cor(TrainData$IndepVar1, TrainData$IndepVar2) cor(TrainData)

If our out-of-sample data, or test set, is called “TestData”, we can compute test set predictions and the test set R² as below:

TestPredictions = predict(RegModel, newdata=TestData) SSE = sum((TestData$DepVar - TestPredictions)^2) SST = sum((TestData$DepVar - mean(TrainData$DepVar))^2) Rsquared = 1 - SSE/SST

Thus Rsquared does three way comparison. SSE : Test data with respect to prediction from model, SST :Test data with respect of training data.