Welcome to Spring Semester 2025.

Nonparametric Regression

Our data and goal

We want to use the wooldridge::wage2 data on wages to generate and tune a non-parametric model of wages using a regression tree (or decision tree, same thing). Use ?wage2 (after you’ve loaded the wooldridge package) to see the data dictionary.

We’ve learned that our RMSE calculations have a hard time with NAs in the data. So let’s use the skim output to tell us which variables have too many NA (see n_missing) values and should be dropped. Let’s set a high bar here, and drop anything that isn’t 100% complete. Of course, there are other things we can do (impute the NAs, or make dummies for them), but for now, it’s easiest to drop them.

Remember, skim() is for your own exploration, it’s not something that you should include in your output.

Once cleaned, we should be able to use rpart(wage ~ ., data = wage_clean) and not have any NAs in our prediction. That’s what we’ll do in our first Breakout.

TRY IT

Predicting wage

I’m going to have you form groups of 2-3 to work on live-coding. One person in the group will need to have Rstudio up, be able to share screen, and have the correct packages loaded (caret, rpart, and rpart.plot, plus skimr). Copy the rmse and get_rmse code into a blank R script (you don’t have to use RMarkdown, just use a blank R script and run from there). You’ll also want to have these functions from last week loaded:

rmse = function(actual, predicted) {
  sqrt(mean((actual - predicted) ^ 2))
}
get_rmse = function(model, data, response) {
  rmse(actual = subset(data, select = response, drop = TRUE),
       predicted = predict(model, data))
}

For the first breakout, all I want you to do is the following. We are trying to predict wage for this exercise:

  • Estimate a default regression tree on wage_clean using the default parameters (use the whole dataset for now, we’ll train-test on the problem set).
  • Use rpart.plot to vizualize your regression tree, and talk through the interpretation with each other.
  • Calculate the RMSE for your regression tree.

We’ll spend about 5 minutes on this. Remember, you can use ?wage2 to see the variable names. Make sure you know what variables are showing up in the plot and explaining wage in your model. You may find something odd at first and may need to drop more variables…

5 minutes

Breakout #1 discussion

Let’s choose a group to share their plot and discuss the results. Use our sharing zoom link: bit.ly/EC242

Side note: using lapply to repeat something

As we did in the last two labs, we’ll use lapply to iterate over a wide range of “tuning parameters” to optimize our model (find the lowest test RMSE).

Once we have a list of things we want to iterate on, lapply lets us do something to each element of the list, and then returns a list of results (of the same length as the input). The arguments of lapply are (1) the list we want to iterate over, (2) the function we want to use, and, when needed, (3) other arguments to pass to the function (that are the same for every element in the list). The list we want to iterate over might be a list of models (as we do when we want to use get_rmse on each model), or it might be a list (or a vector that can be coerced to a list) of tuning parameters.

Let’s say we wanted to repeat our TRY IT with the cp values of c(0, .01, .02, .03). We could make a list of those values, and use lapply to run rpart() and use each of the 4 values of cp. One way we could try doing this would be something like:

myList = lapply(c(0, .01, .02, 03), rpart, formula = wage ~ ., data = wage_clean, cp = ???)

The only problem here is that the first argument of rpart is not the cp parameter, so we can’t just pass it in additional arguments.

Instead, we previously introduced the anonymous function in lapply as follows:

myList = lapply(c(0, .01, .02, 03), function(x){
  rpart(wage ~ ., data = wage_clean, cp = x)
})

Well, it gets us the right answer, but whaaaaaat is going on? Curly brackets? x?

This is an “anonymous function”, or a function created on the fly. Here’s how it works in lapply:

  • The first argument is the list you want to do something to (same as before)
  • The second argument would usually be the function you want to apply, like get_rmse
  • Here, we’re going to ask R to temporarily create a function that takes one argument, x.
  • x is going to be each list element in myList.
  • Think of it as a loop:
    • Take the first element of myList and refer to it as x. Run the function’s code.
    • Then it’ll take the second element of myList and refer to it as x and run the function’s code.
    • Repeat until all elements of x have been used.
  • Once the anonymous function has been applied to x, the result is passed back and saved as the new element of the list output, always in the same position from where the x was taken.

The curly brackets are the temporary function you want to create.

So you can think of it like this:

x = 0
myList[[1]] = rpart(wage ~ ., data = wage_clean, cp = x)

x = 0.01
myList[[2]] = rpart(wage ~ ., data = wage_clean, cp = x)

x = 0.02
myList[[3]] = rpart(wage ~ ., data = wage_clean, cp = x)

As long as every entry in myList is a valid value of cp, then you’ll get the results in a list. Then, unlist just coerces the list to a vector. This is also the same way we get the RMSE, but instead of the anonymous function, we write one out and use that.

If you had a list of rpart objects and wanted to, say, extract the cp tuning parameter from each, you could also use an anonymous function. Each rpart object has a list called control that includes all of the control (tuning) parameters used in its estimation:

cpList = lapply(myList, function(y){
  y$control$cp
})

unlist(cpList)
TRY IT

Breakout #2

Using the loop method or the lapply method, generate 10 regression trees to explain wage in wage_clean. You can iterate through values of cp, the complexity parameter, or minsplit, the minimum # of points that have to be in each split.

10 minutes

Breakout #2 discussion

We’ll ask someone to share (a few of) their group’s 10 regression trees using rpart.plot. Use bit.ly/KIRKPATRICK to volunteer to share.

TRY IT

Breakout #3 (if time)

Use lapply to get a list of your RMSE’s (one for each of your models). The anonymous function may come in handy here (though it’s not necessary). Note that we are not yet splitting into test and train (which you will need to do on your lab assignment).

Once you have your list, create the plot of RMSEs similar to the one we looked at in Content this week. Note: you can use unlist(myRMSE) to get a numeric vector of the RMSE’s (as long as all of the elements in myRMSE are numeric). Then, it’s a matter of plotting either with base plot or with ggplot (if you use ggplot you’ll have to tidy the RMSE by adding the index column or naming the x-axis).