Shrinkage with LASSO, Ridge, and Random Forests

Required Reading

This page.

Guiding Questions

What is shrinkage?
What do we do with too may right-hand-side variables?
What is LASSO?

This week is a little different

We will use the lecture slides from my friend and ace economist, Ed Rubin (U of Oregon). They are available right here. A couple things first, though:

The data and packages

Dr. Rubin uses the ISLR package’s credit dataset, which we can get from the ISLR package (which you may need to install). You’ll also want to install the wooldridge package for our later work:

library(ISLR)
credit = ISLR::Credit
head(credit)

  ID  Income Limit Rating Cards Age Education Gender Student Married Ethnicity
1  1  14.891  3606    283     2  34        11   Male      No     Yes Caucasian
2  2 106.025  6645    483     3  82        15 Female     Yes     Yes     Asian
3  3 104.593  7075    514     4  71        11   Male      No      No     Asian
4  4 148.924  9504    681     3  36        11 Female      No      No     Asian
5  5  55.882  4897    357     2  68        16   Male      No     Yes Caucasian
6  6  80.180  8047    569     4  77        10   Male      No      No Caucasian
  Balance
1     333
2     903
3     580
4     964
5     331
6    1151

We will also need to load the caret package (which you’re used before), as well as the glmnet package, which is new for us.

Terminology

I used to have this lecture at the end of our semester, but I think the intuition behind LASSO and ridge regression helps understand our “overfitting” problem. There are two terms I want to cover before we dive into the slides:

Bias vs. Variance: We saw that super-overfit polynomial last week (where we took 20 observations and fit a 16th-degree polynomial). The model we fit was very flexible and bendy, but it did get most of the data points right. It had low bias as it was generally right, but had huge variance – it was all over the place, even within a small range of advertising mode spending. Bias vs. variance refers to the innate tradeoff between these two things. When we used the train and test samples to get the best out of sample fit, we were balancing bias and variance
Cross validation: This is the term for using two (or more) different subsets of sample data (e.g. test and train) to fit a model.

Now, back to the lecture notes for today

Note

Try it! Use the wooldridge::wage2 dataset and LASSO to predict wage. Since these shrinkage methods (LASSO, ridge) work well with many right-hand-side variables, we can create some additional variables. A good candidate would be squared versions of all of the numeric variables. To do this, we’ll use mutate_if along with is.numeric. The function mutate_if, when used in a pipe %>%, will check each column to see if the given function is true, then will add a new mutation of that column when true. We need this because we don’t want to try to add a squared term for things like factor variables. It will look something like this:

data %>%
  mutate_if(is.numeric, list(squared = function(x) x^2))

We pass a list with potentially many functions, but only one here in the example. For each passed function, mutate_if will create a copy of each existing column, square the data, and call name it by appending the name (here, ‘squared’) to the original column name.

This will results in some columns we don’t want to keep – wage_squared should be dropped since wage is the outcome we want to predict. We should also drop lwage and lwage_squared since those are transformations of our outcome. We want our data to only have the outcome and all the possible predictors so we can simply pass everything but wage for our X’s and wage for our y.

Drop all of the copies of the outcome variable and transformations of the outcome variable. Also, drop any of the squared terms of binary variables – if a variable is ${0, 1}$ , then the squared term is exactly the same. For instance, married and married_squared are exactly the same.
How many predictors (right-side variabes) do we have?
Use glmnet from the glmnet package to run a LASSO (alpha=1) using all of the data you’ve assembled in the wage2 dataset. Select a wide range of values for lambda. You can use cv.glmnet to do the cross-validation for you (see the Rubin lecture notes), or you can do it manually.
Find the lambda that minimizes RMSE in the test data. You can extract the optimal lambda from a cv.glmnet object by referring to object$lambda, and the RMSE using sqrt(object$cvm). These can be used to make the plots in the Rubin lecture notes, and to find the lambda that minimizes RMSE. object$lambda.min will also tell you the optimal lambda.
Using the optimal lambda, run a final LASSO model.
Use coef to extract the coefficients from the optimal model. Coefficients that are zeroed out by the LASSO are shown as ‘.’

Which variables were “kept” in the model?
Which variables were eliminated from the model at the optimal lambda?
Is the train RMSE lower than it would be if we just ran OLS with all of the variables?

Random Forests

The decision trees (or “regression trees”) we previously saw were hopefully an intuitive way of partitioning out the predictor variable space into prediction areas. By “partitioning”, I mean chopping up the predictor space in a way that explains the data best, then using the resulting regions as the prediction for any value of the predictors. From ISLR Chapter 8:

We can change the two tuning parameters - cp and minsplit to make the partition more or less detailed (have more or fewer terminal nodes), and there are actually other parameters we can play with in regression trees, though we won’t address those here. If we tune those parameters by cross-validation, then we can generate a pretty good predictor with our chosen tree. But…we can still do better.

Regression trees still have quite a bit of variance, and we’d love to reduce that variance. One way of doing this is by constructing a random forest. As the name indicates, a random forest is a whole bunch of trees put together. But how do we combine multiple trees? And how do we make multiple trees if we have one dataset that leads to one optimal tree?

Multiple trees by bootstrap

The “bootstrap” is a statistical term that means, essentially, randomly resampling. We can “bootstrap” our existing (training) data by randomly drawing observations from the data. Usually, if we have $n$ observations, we’ll draw $n$ observations with replacement, which gives us a slightly different dataset: some observations will not be in the bootstrapped sample, and some will be represented 2+ times. All of the bootstrapped samples follow the same distribution of the predictors, but are different realizations. It’s like getting more random samples for free!

So let’s say you draw sample $b = 1$ from the $n$ observations. It is also of size $n$ , but is different from the original sample. Then, you draw sample number $b = 2$ , all the way to, say, $B = 250$ . Now you have 250 different samples of data, and each one will generate a different regression tree, even using the same tuning parameter values.

Combining multiple trees

We need a way of generating a RMSE for any given tuning parameter value, but now we have $B$ different trees. We can take our test data and put it into each of the $B$ trees and get a predicted value, right? And when the tree is different, then the prediction will be different. If we take the average over all $B$ predictions, we’ll get a prediction that has less variance, even when each tree has variance $σ$ , because variance of the mean scales to $\frac{σ}{\sqrt{n}}$ .

So the rmse is:

$R M S E = \frac{1}{N} \sum_{i = 1}^{N} (\frac{1}{B} \sum_{b = 1}^{B} ({\hat{f}}_{b} (x_{i}) - y_{i})^{2})$

While it takes a bit longer to estimate this (especially when $B$ is large), you get much more stable results that often predict better on an evaluation hold-out sample.

OK, technically that’s “bagging”

Random Forest has one other wrinkle – when choosing amongst the candidate predictors for each split, the Random Forest method will choose from a randomly selected subset of the predictors. The reasoning for this given in ISLR is intuitive: if there is one predictor that predicts quite well for the sample, it will always be selected and little emphasis will be on predictors aside from this one. By randomly leaving it out, some trees are forced to fit without that predictor, so they won’t all look the same. When you don’t take the subset, you are bagging. When you do take a subset, you are estimating a random forest

You won’t need to use this on your Project 2, it’s purely for your own info. Random forests are common out in the world, and are useful to know especially since you already know the fundamental element of it – the tree.

More examples of bias, variance, and model flexibility

If we have time, we can dive back into a little more about model flexibility, bias, and variance….

Model Flexibility

Let’s return to the simulated dataset we used occaisionally in the linear regression content. Recall there was a single feature $x$ with the following properties:

# define regression function
cubic_mean = function(x) {
  1 - 2 * x - 3 * x ^ 2 + 5 * x ^ 3
}

We then generated some data around this function with some added noise:

# define full data generating process
gen_slr_data = function(sample_size = 100, mu) {
  x = runif(n = sample_size, min = -1, max = 1)
  y = mu(x) + rnorm(n = sample_size)
  tibble(x, y)
}

After defining the data generating process, we generate and split the data.

# simulate entire dataset
set.seed(3)
sim_slr_data = gen_slr_data(sample_size = 100, mu = cubic_mean)

# test-train split
slr_trn_idx = sample(nrow(sim_slr_data), size = 0.8 * nrow(sim_slr_data))
slr_trn = sim_slr_data[slr_trn_idx, ]
slr_tst = sim_slr_data[-slr_trn_idx, ]

# estimation-validation split
slr_est_idx = sample(nrow(slr_trn), size = 0.8 * nrow(slr_trn))
slr_est = slr_trn[slr_est_idx, ]
slr_val = slr_trn[-slr_est_idx, ]

# check data
head(slr_trn, n = 10)

# A tibble: 10 × 2
        x      y
    <dbl>  <dbl>
 1  0.573 -1.18 
 2  0.807  0.576
 3  0.272 -0.973
 4 -0.813 -1.78 
 5 -0.161  0.833
 6  0.736  1.07 
 7 -0.242  2.97 
 8  0.520 -1.64 
 9 -0.664  0.269
10 -0.777 -2.02

For validating models, we will use RMSE.

# helper function for calculating RMSE
calc_rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}

Let’s check how linear, k-nearest neighbors, and decision tree models fit to this data make errors, while paying attention to their flexibility.

This picture is an idealized version of what we expect to see, but we’ll illustrate the sorts of validate “curves” that we might see in practice.

Note that in the following three sub-sections, a significant portion of the code is suppressed for visual clarity. See the source document for full details.

Linear Models

First up, linear models. We will fit polynomial models with degree from one to nine, and then validate.

# fit polynomial models
poly_mod_est_list = list(
  poly_mod_1_est = lm(y ~ poly(x, degree = 1), data = slr_est),
  poly_mod_2_est = lm(y ~ poly(x, degree = 2), data = slr_est),
  poly_mod_3_est = lm(y ~ poly(x, degree = 3), data = slr_est),
  poly_mod_4_est = lm(y ~ poly(x, degree = 4), data = slr_est),
  poly_mod_5_est = lm(y ~ poly(x, degree = 5), data = slr_est),
  poly_mod_6_est = lm(y ~ poly(x, degree = 6), data = slr_est),
  poly_mod_7_est = lm(y ~ poly(x, degree = 7), data = slr_est),
  poly_mod_8_est = lm(y ~ poly(x, degree = 8), data = slr_est),
  poly_mod_9_est = lm(y ~ poly(x, degree = 9), data = slr_est)
)

The plot below visualizes the results.

What do we see here? As the polynomial degree increases:

The training error decreases.
The validation error decreases, then increases.

This more of less matches the idealized version above, but the validation “curve” is much more jagged. This is something that we can expect in practice.

We have previously noted that training error isn’t particularly useful for validating models. That is still true. However, it can be useful for checking that everything is working as planned. In this case, since we known that training error decreases as model flexibility increases, we can verify our intuition that a higher degree polynomial is indeed more flexible.¹

k-Nearest Neighbors

Next up, k-nearest neighbors. We will consider values for $k$ that are odd and between $1$ and $45$ inclusive.

library(caret)
# helper function for fitting knn models
fit_knn_mod = function(neighbors) {
  knnreg(y ~ x, data = slr_est, k = neighbors)
}

# define values of tuning parameter k to evaluate
k_to_try = seq(from = 1, to = 45, by = 2)

# fit knn models
knn_mod_est_list = lapply(k_to_try, fit_knn_mod)

The plot below visualizes the results.

Here we see the “opposite” of the usual plot. Why? Because with k-nearest neighbors, a small value of $k$ generates a flexible model compared to larger values of $k$ . So visually, this plot is flipped. That is we see that as $k$ increases:

The training error increases.
The validation error decreases, then increases.

Important to note here: the pattern above only holds “in general,” that is, there can be minor deviations in the validation pattern along the way. This is due to the random nature of selection the data for the validate set.

Decision Trees

Lastly, we evaluate some decision tree models. We choose some arbitrary values of cp to evaluate, while holding minsplit constant at 5. There are arbitrary choices that produce a plot that is useful for discussion.

# helper function for fitting decision tree models
library(rpart)
library(rpart.plot)
tree_knn_mod = function(flex) {
  rpart(y ~ x, data = slr_est, cp = flex, minsplit = 5)
}

# define values of tuning parameter cp to evaluate
cp_to_try = c(0.5, 0.3, 0.1, 0.05, 0.01, 0.001, 0.0001)

# fit decision tree models
tree_mod_est_list = lapply(cp_to_try, tree_knn_mod)

The plot below visualizes the results.

Based on this plot, how is cp related to model flexibility?²

Footnotes

In practice, if you already know how your model’s flexibility works, by checking that the training error goes down as you increase flexibility, you can check that you have done your coding and model training correctly.↩︎
As cp increases, model flexibility decreases.↩︎