6: Correlations and Simple Linear Models

Due Date

This assignment is due on Monday, February 24th

All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative.

Backstory and Set Up

You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.

First, let’s load the data.

ameslist  <- read.csv('https://ec242.netlify.app/data/ames.csv',
                      stringsAsFactors = FALSE)

Before we proceed, let’s note one thing about the (simple) code above. We specify an argument to read.csv called stringsAsFactors. By default, read.csv (the base CSV reading function, which is different from read_csv, the tidyverse CSV function) turns anything that is a character vector into a factor variable. That’s great if you’re importing things like state abbreviations. That’s not helpful if you’re importing character strings that don’t have any repitition (e.g. names), or character strings that really should be numeric. We’ll handle our character strings our selves, thankyouverymuch.

Data Exploration and Processing

We are not going to tell you anything about this data. This is intended to replicate a real-world experience that you will all encounter in the (possibly near) future: someone hands you data and you’re expected to make sense of it. Fortunately for us, this data is (somewhat) self-contained. We’ll first check the variable names to try to divine some information. Recall, we have a handy little function for that:

names(ameslist)

Note that, when doing data exploration, we will sometimes choose to not save our output. This is a judgement call; here we’ve chosen to merely inspect the variables rather than diving in.

Inspection yields some obvious truths. For example:

Variable	Explanation	Type
`ID`	Unique identifier for each row	`int`
`LotArea`	Size of lot (units unknown)	`int`
`SalePrice`	Sale price of house ($)	`int`

…but we face some not-so-obvious things as well. For example:

Variable	Explanation	Type
`LotShape`	? Something about the lot	`chr`
`MSSubClass`	? No clue at all	`int`
`Condition1`	? Seems like street info	`chr`

It will be difficult to learn anything about the data that is of type int without outside documentation unless it refers to a count of something (e.g. bedrooms). However, we can learn something more about the chr-type variables. In order to understand these a little better, we need to review some of the values that each take on. We can use unique() to see the unique values it takes. Sometimes, it helps to see how often some value comes up if we’re trying to understand a variable’s meaning. One handy way of learning this is to use table() (which we’ve seen before) to get a count of the different values a variable can take. This dataset will have some pernicious NAs in it, so when you use table, add useNA = 'always' as an argument to ensure that we see counts of all values.

Try it: Go through the variables in the dataset and make a note about your interpretation for each. Many will be obvious, but some require additional thought.

Although there are some variables that would be difficult to clean, there are a few that we can address with relative ease. Consider, for instance, the variable GarageType. This might not be that important, but, remember, the weather in Ames, IA is pretty crummy—a detached garage might be a dealbreaker for some would-be homebuyers. Let’s inspect the values:

> unique(ameslist$GarageType)
[1] Attchd  Detchd  BuiltIn CarPort <NA> Basment 2Types

With this, we could make an informed decision and create a new variable. Let’s create OutdoorGarage to indicate, say, homes that have any type of garage that requires the homeowner to walk outdoors after parking their car. (For those who aren’t familiar with different garage types, a car port is not insulated and is therefore considered outdoors. A detached garage presumably requires that the person walks outside after parking. The three other types are inside the main structure, and 2Types we can assume includes at least one attached garage of some sort).

EXERCISE 1 of 5

Use case_when to add a OutdoorGarage column to ameslist that takes the value of 1 when the house has an outdoor garage, and 0 otherwise. Make sure this is a numeric variable (type int). It’s up to you to take a stand on what to do with NA values. Are those outdoor? Indoors? Do we drop all NA values? You often have to decide what to do in cases like this using your knowledge of the context. Document your reasoning for how you handle NA values.

Generally speaking, this is a persistent issue, and you will spend an extraordinary amount of time dealing with missing data or data that does not encode a variable exactly as you want it. This is expecially true if you deal with real-world data: you will need to learn how to handle NAs. There are a number of fixes (as always, Google is your friend) and anything that works is good. But you should spend some time thinking about this and learning at least one approach.

Our goal now is to learn something about correlates between home sale price and the rest of the data. Along the way, you may want to create more variables like OutdoorGarage – for now, make sure those variables are represented as 0 and 1. We’ll work with factor variables more later.

EXERCISES 2-5

Prune the data to 6-8 of the variables that are type = int about which you have some reasonable intuition for what they mean. Choose those that you believe are likely to be correlated with SalePrice. This must include the variable SalePrice and GrLivArea. Call this new dataset Ames. Produce documentation for this object in the form of a Markdown table or see further documentation here. This must describe each of the 6-8 preserved variables, the values it can take (e.g., can it be negative?) and your definition of the variable. Counting the variable name, this means your table should have three columns. Markdown tables are entered in the text body, not code chunks, of your .rmd, so your code creating Ames will be in a code chunk, and your table will be right after it.
Produce a scatterplot matrix of the chosen variables¹
Compute a matrix of correlations between these variables using the function cor(). Do the correlations match your prior beliefs? Briefly discuss the correlation between the chosen variables and SalePrice and any correlations between these variables.
Produce a scatterplot between SalePrice and GrLivArea. Run a linear model using lm() to explore the relationship. Finally, use the geom_abline() function to plot the relationship that you’ve found in the simple linear regression. You’ll need to extract the intercept and slope from your lm object. See coef(...) for information on this.²
- What is the largest outlier that is above the regression line? Produce the other information about this house.

(Bonus) Create a visualization that shows the rise of air conditioning over time in homes in Ames.

Footnotes

If you are not familiar with this type of visualization, consult the book (Introduction to Statistical Learning), Chapters 2 and 3.↩︎
We could also use geom_smooth(method = 'lm') to add the regression line, but it’s good practice to work with lm objects.↩︎