library(tidyverse)
Default = read_csv('https://ec242.netlify.app/data/UCI_credit.csv')Illustrating Classification
Today’s example will build on material from the previous lecture (earlier this week).
Predicting Defaults
Today, we will work with UC Irvine’s Credit Default Dataset. I have made a condensed version of it to work with.
About the data
I’ve imported the data from the UCI source. Columns PAY_0 through PAY_6 show the last 6 months of payment status, where -2 means “no payment required”, -1 and 0 mean paid on time, and 1 and above mean 1 month behind, 2 months behind, etc. Bill and Payment amounts as labeled, with 1 being most recent, and 6 being six months ago.
In our first breakout:
1.Check to make sure the default column is a binary indicator for default. Check to make sure there are no surprise NA values (there are!). Then, construct any additional variables you might think useful (hint: the PAY_X variables don’t really have an intuitive numeric interpretation. Should we categorize some of them together?)
Explore the data as we did on Tuesday to get an idea of useful predictors.
Build a logistic model to predict
defaultusing any combination of variables and interactions in the data. For now, just use your best judgement for choosing the variables and interactions.Use a Bayes Classifier cutoff of .50 to generate your classifier output.
Let’s look at how we did. What variables were most useful in explaining default?
In our second breakout, we will create a ROC curve manually. To do this
Take your model from the first breakout, and using a loop (or
sapply), step through a large number of possible cutoffs for classification ranging from 0 to 1.For each cutoff, generate a confusion matrix with accuracy, sensitivity and specificity.
Combine the cutoff with the sensitivity and specificity results and make a ROC plot. Use
ggplotfor your plot and map the color aesthetic to the cutoff value.Calculate the AUC (the area under the curve). This is a little tricky but can be done with your data.