Description

  conduct variable selection on the AutoLoss data set from the previous assignment.

To begin your work on this question, as in the previous homework, run the following two lines of code: The first one replaces ?s with NA while reading the data from the .csv file, and the second one removes all the observations with any NA.

AutoLoss <- read.csv(“AutoLoss.csv”, na.strings = “?”)

AutoLoss <- na.omit(AutoLoss)

Recall our goal is to predict the losses paid by an insurance company as a function of the predictors, which are several features of a vehicle. You will use the LASSO method to conduct variable selection.

(a) Convert string variables from the original dataset to dummy variables. Create a new dataset with including the dummy variables.

(b) Fit a LASSO model to the data set, using Losses as the response (output variable) and all other variables as predictors (input variables), for l values ranging from nlambda=10.

(c) For l = 1: What predictors are included in the resulting model? For l = 1.96: What predictors are included in the resulting model?

(d) What do you observe about the coefficient estimates you obtain as l increases?

(e) In this question, you will continue to use the LASSO method on the AutoLoss data set. This time, you will use 6-fold cross-validation to find the best value for l. Remember to include set.seed(566) before using cv.glmnet(), so we all end up making the same split. Paste the MSE (mean standard error) plot. State the best value of the tunning parameter l that minimizes cross validated MSE.

(f) Now find the value of l such that the cross-validated error is exactly 1 standard error away from the minimum error. Describe how sparse is the model (that is, how many variables are selected at this value).

(g) Fit an OLS linear regression model using variables selected in (f). Summarize your linear regression output. Report which variables are statistically significant at 0.05 level. What is the adjusted R-squared of the linear model?