Ames <- read.table('https://ec242.netlify.app/data/ames.csv',
header = TRUE,
sep = ',') 8: Model Building
This assignment is due on Monday, March 16th
All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative. Remember, if you use any AI for coding, you must comment each line with your own interpretation of what that line of code does.
This week’s lab will extend last week’s lab. The introduction is a direct repeat.
Backstory and Set Up
You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.
First, let’s load the data.
Building a Model
We’re now ready to start playing with a model. We will start by using the lm() function to fit a simple linear regression model, with SalePrice as the response and GrLivArea as the predictor.
Recall that the basic lm() syntax is lm(y∼x,data), where y is the response, x is the predictor, and data is the data set in which these two variables are kept. Let’s quickly run this with two variables:
lm.fit = lm(SalePrice ~ GrLivArea, data = Ames)If we type lm.fit, some basic information about the model is output. For more detailed information, we use summary(lm.fit). This gives us p-values and standard errors for the coefficients, as well as the \(R^2\) statistic and \(F\)-statistic for the entire model.1
Utilizing these functions hels us see some interesting results. Note that we built (nearly) the simplest possible model:
\[\text{SalePrice} = \beta_0 + \beta_1*(\text{GrLivArea}) + \epsilon.\]
But even on its own, this model is instructive. It suggest that an increase in overall living area of 1 ft \(^2\) is correlated with an expected increase in sales price of $107. (Note that we cannot make causal claims!)
Saving the model as we did above is useful because we can explore other pieces of information it stores. Specifically, we can use the names() function in order to find out what else is stored in lm.fit. Although we can extract these quantities by name—e.g. lm.fit$coefficients—it is safer to use the extractor functions like coef(lm.fit) to access them. We can also use a handy tool like plot() applied directly to lm.fit to see some interesting data that is automatically stored by the model.
Try it: Use plot() to explore the model above (it will make a sequence of plots; don’t put it in a code chunk, just use it for your own exploration). Do you suspect that some outliers have a large influence on the data? We will explore this point specifically in the future.
We can now go crazy adding variables to our model. It’s as simple as appending them to the previous code—though you should be careful executing this, as it will overwrite your previous output:
lm.fit = lm(SalePrice ~ GrLivArea + LotArea, data = Ames)Try it: Does controlling for LotArea change the qualitative conclusions from the previous regression? What about the quantitative results? Does the direction of the change in the quantitative results make sense to you?
- Use the
lm()function in a simple linear regression (e.g., with only one predictor) withSalePriceas the response to determine the value of a garage. Interpret the coefficient in 1-2 sentences.
Data Cleaning
Do not skip this section. This isn’t your kitchen junk drawer – you can’t get away with not cleaning your data.
Oh, the Ames data yet again. It’s given us lots of trouble. Many of you have found a few variables (columns) that should be avoided. The main problem is that some columns are mostly NA, or have only one value in them, or they have only NA and one value, so once lm(...) drops the NA rows, they are left with only one value. Linear regression by OLS does not like variables that don’t vary, and it doesn’t know how to handle NAs (remember, regression coefficients are functions of sums and averages, and NA infects all those things)! So, let’s be systematic about figuring out which columns in our data are to be avoided.
The function na.omit takes a data.frame and drops all rows in which there is one or more NA. So if you have a lot of variables and they have lots of NAs, you may lose a lot of rows. The first thing lm() will do is run na.omit, so let’s try it for ourselves:
Calculate the number of rows left in
Amesafter you usena.omit()to drop any row that has one or moreNAs in it. Don’t overwrite your originalAmesdata – just get the number of rows of the output. AreNAs a problem here? Take a look at the data, does it look like there’s one variable with a lot ofNAs or are theNAs scattere throughout the data?Just for fun, use the
lm()function to perform a multiple linear regression withSalePriceas the response and all other variables from your originalAmesdata as the predictors. You can do this easily with the formulaSalePrice ~ .which tellslmto use all of the data’s columns (exceptSalePrice) on the right-hand-side. What was the output?
skimr
The skimr package is very helpful for seeing what our data contains. Install it, and then use skim(Ames) directly in the console (we’re just looking at data – do not put skim output into your RMarkdown output - it will give you an error). Take a look at the “complete rate” column - this tells us the fraction of observations in that column that are NA. If it’s very small, then that variable will be problematic. The “n_unique” column tells us if there are few or many different values - a “1” in “n_unique” is definitely going to be a problem as well, and you must drop that variable.
You can make a note of those columns that have extremely low “complete rates” or only 1 unique value and drop them to start off. Of course, we could keep them, and drop all observations where any of those columns are NA, but we did that in exercise 2 and it decimated our data!
Using skimr, make a character vector of column names that should be dropped. You can get started with this, but you’ll need to add more:
dropCols = c('PoolQC','MiscFeature')Then, we can use dplyr::select() with a tidyselect function contains(dropCols), which finds all the column names that contain any of the character vector dropCols (and the - drops them):
AmesClean = Ames %>%
dplyr::select(-contains(dropCols))Do this, then check the number of rows that will be left for the regression using na.omit. Get to a reasonable number of rows, then save AmesClean and use that from here on out.
Now that your data is clean, use
lm()and the formulaSalePrice ~ .to regressSalePriceon all of the explanatory variables in the data. Remember~ .tells R to use every variable that isn’t the left hand side variable. Then, use thesummary()ortidy()function on yourlmobject to print the results. Comment on the output. For instance:- Is there a relationship between the predictors and the response?
- Which predictors appear to have a statistically significant relationship to the response? (Hint: look for stars)
- What does the coefficient for the year variable suggest?
There are a few
NAs in the output from the regression in Question 2. These aren’t because ofNAs in your data. You can usetidyto save the output in a familiar data.frame style “tibble”, and then explore it to see what variables are coming upNA. Remember whatRdid when we tried to give it dummy variables representing all three possible values of a factor variable (see parameterization in this week’s lecture). Keeping that in mind, scroll to the firstNAin your regression output and see if you can explain why it might beNA. Please remember that we can use functions likeView()to explore data, but we never putView()in a code chunk.It’s rarely a good idea to throw all the variables into a regression. We want to be smarter about building our model. We’ll use fewer variables, but include interactions. As we saw this week, the
:symbol allows you to create an interction term between two variables. Use the:symbols to fit a linear regression model with one well-chosen interaction effect plus 3-4 of the other variables of your choice. Why did you select the variables you did, and what was the result?Try a few (e.g., two) different transformations of the variables, such as \(ln(x)\), \(x^2\), \(\sqrt x\). Do any of these make sense to include in a model of
SalePrice? Comment on your findings.
Footnotes
When we use the simple regression model with a single input, the \(F\)-stat includes the intercept term. Otherwise, it does not. See Lecture 5 for more detail.↩︎