<- read.table('https://raw.githubusercontent.com/ajkirkpatrick/FS20/postS21_rev/classdata/ames.csv',
Ames header = TRUE,
sep = ',') %>%
::select(-Id) dplyr
11: LASSO
This assignment is due on Monday, April 7th
All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative.
Oh no. Really? Ames again?
Yes, Ames again. Let’s predict some SalePrice
s!
Data cleaning
Repeat the data cleaning exercise from last week’s lab. The point is to make sure that every observation is non-NA
and all predictor variables have more than one value. Use skimr::skim
on Ames
to find predictors with only one value or are missing many values. Take them out, and use na.omit
to ensure there are no NA
values left. Check to make sure you still have at least 800 or so observations!
Predicive model
For the assignment below, we’ll use glmnet::cv.glmnet
to estimate a LASSO model. Note that you’re asked to state 16 predictors and 5 interactions. You can go beyond this. Unlike our linear model building, complexity in LASSO is not controlled by writing out a bunch of formulas with more terms. It’s in the lambda parameter. So we write one formula and let lambda vary.
Clean your data as described above.
Choose up to 16 predictor variables and clean your data so that no
NA
values are leftChoose at least 5 interactions between your predictor variables and print out the formula you’ll use to predict
SalePrice
.In your code, use
set.seed(24224)
so that your results will always be the same. Why do we need to set seed? When we (well,glmnet::cvglmnet
) makes the Train and Test sample(s), it’ll select them randomly. If you don’t set seed, every time you run it, you’ll get slightly different answers!Using
glmnet::cv.glmnet
to estimate a LASSO model (see lecture notes this week) that predictsSalePrice
given the observed data and using your formula. Slide 33 shows cross-validation using bothalpha
andlambda
– a LASSO model holdsalpha
fixed atalpha = 1
. We’ll search usinglambda
as our tuning parameter. Call the resulting objectnet_cv
.
- To do this, you’ll have to make a matrix to give to
cv.glmnet
in thex
argument because it doesn’t take a formula. You can usemodel.matrix()
to create the matrix. Use that matrix as yourx
. It will not add theSalePrice
variable to thex
matrix – you just have to give itSalePrice
as they
variable.
The resulting object will be a
glmnet
object. You can see the optimal lambda just by printing the objectprint(net_cv)
and looking at the min value for Lambda. Make sure the optimal (RMSE-minimizing) value of lambda is not the largest or smallest value of lambda you gave it. If it is, then extend the range of lambdas until you get an interior solution. Following the instructions from our lecture note’s TRY IT, extract the lambdas and their respective RMSE values into a data.frame and make a plot similar to the RMSE plot from lecture.Answer the following question: What is the optimal
lambda
based on the plot/data? Do you see a minimum point in the plot?Extracting the non-zero coefficients is a little tricky, but let’s do it. We’ll use the
coef
function to extract the coefficients. Thecoef
function, when used on aglmnet
object, takes the arguments
which is thelambda
value for which you’d like to extract coefficients. Ours
value should be the best value of lambda, which we can extract fromnet_cv$lambda.min
. Put those together:coef(net_cv, s = net_cv$lambda.min)
. This may be kinda long, that’s OK.