10: Nonparametric Models

Due Date

This assignment is due on Monday, March 31st

All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative.

Backstory and Set Up

You work for a bank. This bank is trying to predict defaults on loans. These are costly to the bank and, while rare, avoiding them is how banks make money. They’ve given you a dataset on defaults (encoded as the variable default). You’re going to try to predict this (that is, default is your target variable).

This is some new data. The snippet below loads it.

bank <- read.csv("https://ec242.netlify.app/data/bank23.csv",
                 stringsAsFactors = FALSE)

There’s not going to be a whole lot of wind-up here. You should be well-versed in doing these sorts of things by now (if not, look back at the previous lab for sample code).

EXERCISE 1

Check the data using skim and str to see what sort of data you have. kNN, as we’ve covered it so far, takes an average of the target variable for the $k$ nearest neighbors. Do any data processing necessary to use kNN to predict default.
kNN needs to make a numeric prediction. Since default is not numeric, make a new column for it that is numeric. Of course, you’ll need to encode the numbers in a meaningful way (as.numeric('no') will do you no good).
Split the data into an 80/20 train vs. test split. Make sure you explicitly set the seed for replicability.
Run a series of KNN models with $k$ ranging from 2 to 200. Use whatever variables you think will help predict defaults. Remember, $k$ is our complexity parameter – we do not add or subtract any of the explanatory variables, we vary only $k$ . You must have at least 50 different values of $k$ . You can easily write a short function to do this using this week’s lessons and should avoid hand-coding 50 different models.
Create a chart plotting the model complexity as the $x$ -axis variable and RMSE as the $y$ -axis variable for both the training and test data. Pay attention to the values of $k$ that are “higher” in complexity and “lower” in complexity, and make sure the $x$ -axis is increasing in complexity as you go to the right.
Answer the following questions:

What do you think is the optimal $k$ ?
What are you using to decide the optimal $k$ ?
If we were to allow the model a little more complexity than the optimal, how will our training RMSE change? How will our test RMSE change?