<- read.csv('https://ec242.netlify.app/data/ames.csv',
ameslist stringsAsFactors = FALSE)
6: Correlations and Simple Linear Models
This assignment is due on Monday, February 24th
All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative.
Backstory and Set Up
You have been recently hired to Zillow’s Zestimate product team as a junior analyst. As a part of their regular hazing, they have given you access to a small subset of their historic sales data. Your job is to present some basic predictions for housing values in a small geographic area (Ames, IA) using this historical pricing.
First, let’s load the data.
Before we proceed, let’s note one thing about the (simple) code above. We specify an argument to read.csv
called stringsAsFactors
. By default, read.csv
(the base CSV reading function, which is different from read_csv
, the tidyverse
CSV function) turns anything that is a character vector into a factor variable. That’s great if you’re importing things like state abbreviations. That’s not helpful if you’re importing character strings that don’t have any repitition (e.g. names), or character strings that really should be numeric. We’ll handle our character strings our selves, thankyouverymuch.
Data Exploration and Processing
We are not going to tell you anything about this data. This is intended to replicate a real-world experience that you will all encounter in the (possibly near) future: someone hands you data and you’re expected to make sense of it. Fortunately for us, this data is (somewhat) self-contained. We’ll first check the variable names to try to divine some information. Recall, we have a handy little function for that:
names(ameslist)
Note that, when doing data exploration, we will sometimes choose to not save our output. This is a judgement call; here we’ve chosen to merely inspect the variables rather than diving in.
Inspection yields some obvious truths. For example:
Variable | Explanation | Type |
---|---|---|
ID |
Unique identifier for each row | int |
LotArea |
Size of lot (units unknown) | int |
SalePrice |
Sale price of house ($) | int |
…but we face some not-so-obvious things as well. For example:
Variable | Explanation | Type |
---|---|---|
LotShape |
? Something about the lot | chr |
MSSubClass |
? No clue at all | int |
Condition1 |
? Seems like street info | chr |
It will be difficult to learn anything about the data that is of type int
without outside documentation unless it refers to a count of something (e.g. bedrooms). However, we can learn something more about the chr
-type variables. In order to understand these a little better, we need to review some of the values that each take on. We can use unique()
to see the unique values it takes. Sometimes, it helps to see how often some value comes up if we’re trying to understand a variable’s meaning. One handy way of learning this is to use table()
(which we’ve seen before) to get a count of the different values a variable can take. This dataset will have some pernicious NA
s in it, so when you use table
, add useNA = 'always'
as an argument to ensure that we see counts of all values.
Try it: Go through the variables in the dataset and make a note about your interpretation for each. Many will be obvious, but some require additional thought.
Although there are some variables that would be difficult to clean, there are a few that we can address with relative ease. Consider, for instance, the variable GarageType
. This might not be that important, but, remember, the weather in Ames, IA is pretty crummy—a detached garage might be a dealbreaker for some would-be homebuyers. Let’s inspect the values:
> unique(ameslist$GarageType)
1] Attchd Detchd BuiltIn CarPort <NA> Basment 2Types [
With this, we could make an informed decision and create a new variable. Let’s create OutdoorGarage
to indicate, say, homes that have any type of garage that requires the homeowner to walk outdoors after parking their car. (For those who aren’t familiar with different garage types, a car port is not insulated and is therefore considered outdoors. A detached garage presumably requires that the person walks outside after parking. The three other types are inside the main structure, and 2Types
we can assume includes at least one attached garage of some sort).
- Use
case_when
to add aOutdoorGarage
column toameslist
that takes the value of1
when the house has an outdoor garage, and0
otherwise. Make sure this is a numeric variable (typeint
). It’s up to you to take a stand on what to do withNA
values. Are those outdoor? Indoors? Do we drop allNA
values? You often have to decide what to do in cases like this using your knowledge of the context. Document your reasoning for how you handleNA
values.
Generally speaking, this is a persistent issue, and you will spend an extraordinary amount of time dealing with missing data or data that does not encode a variable exactly as you want it. This is expecially true if you deal with real-world data: you will need to learn how to handle NA
s. There are a number of fixes (as always, Google is your friend) and anything that works is good. But you should spend some time thinking about this and learning at least one approach.
Our goal now is to learn something about correlates between home sale price and the rest of the data. Along the way, you may want to create more variables like OutdoorGarage
– for now, make sure those variables are represented as 0
and 1
. We’ll work with factor variables more later.
Prune the data to 6-8 of the variables that are
type = int
about which you have some reasonable intuition for what they mean. Choose those that you believe are likely to be correlated withSalePrice
. This must include the variableSalePrice
andGrLivArea
. Call this new datasetAmes
. Produce documentation for this object in the form of a Markdown table or see further documentation here. This must describe each of the 6-8 preserved variables, the values it can take (e.g., can it be negative?) and your definition of the variable. Counting the variable name, this means your table should have three columns. Markdown tables are entered in the text body, not code chunks, of your .rmd, so your code creatingAmes
will be in a code chunk, and your table will be right after it.Produce a scatterplot matrix of the chosen variables1
Compute a matrix of correlations between these variables using the function
cor()
. Do the correlations match your prior beliefs? Briefly discuss the correlation between the chosen variables andSalePrice
and any correlations between these variables.Produce a scatterplot between
SalePrice
andGrLivArea
. Run a linear model usinglm()
to explore the relationship. Finally, use thegeom_abline()
function to plot the relationship that you’ve found in the simple linear regression. You’ll need to extract the intercept and slope from yourlm
object. Seecoef(...)
for information on this.2- What is the largest outlier that is above the regression line? Produce the other information about this house.
(Bonus) Create a visualization that shows the rise of air conditioning over time in homes in Ames.