Final project

Due Date

This Project is due on Thursday, May 1st, 11:59PM. This is the absolute last day I can accept projects. No exceptions will be given.

As you may have guessed given Project 1 and Project 2, your final project will be a chance for you to work on an analysis pertaining to your own (group’s) interest, and with data you find, clean, and analyze.

This assignment requires three steps which I outline here and detail below. First, you are to choose a topic, likely one that you have been writing about in the second parts of Project 1 and Project 2, and assemble and clean a rich dataset on the topic from multiple sources. Second, present an analysis of the data that highlights a pattern in the data that is not easily expressed in a closed form in the spirit of Project 1, primarily using visuals. This section will focus on a specific outcome (or two) that is of particular interest. Third, design and estimate using cross-validation a predictive model of the outcome of interest that performs well on an out-of-sample slice of your data.

Your final product will take the form of a short memo in two versions – one with all code turned off, and one with all code turned on. Details for each section and a rubric for scoring are below.

A few tips

Based on Project 2’s feedback, I have added a few clarifications in the assignment. Since you are already working on the Final Project, I will summarize everything I’ve added here. Nothing has changed on the grading rubric. These are merely to help you make a good, interesting project

Section 1: Data

Still having trouble finding data, especially with $N$ large enough for a predictive model?

One fun source is the Census Public Use Microdata Survey, which is the individual level survey response to ACS questions. “Individual-level” means we see the individual answer of every person in the ACS. PUMS is publicly available and can be downloaded from https://data.census.gov/mdat. Exploring the data, you can see that it contains the respondent’s answers to questions like “Age” and “Number of cars” and all sorts of stuff that then gets tallied into block/tract/county-level ACS estimates.

Of course, there is a catch: it only reports the person’s state, census regions, and “PUMA” or “Public Use Microdata Area”. PUMAs are much larger than census tracts, but can be pretty useful. To give you an idea of scale, there are about 3 PUMA’s in the greater Lansing area.

You can download the data through the link above. When doing so, at the “Download” tab, make sure you select “extract raw data (.csv)” option, otherwise it will try to summarize the data to a single observation per PUMA, and you (likely) want the individual responses.

Once downloaded tigris::pumas() will download the PUMA polygons just like it did counties and tracts. Use ?pumas to learn more.

Using tidycensus: Another way to access PUMS data is through the tidycensus package. This is a new feature for tidycensus, so you’ll have to lear more here https://walker-data.com/tidycensus/articles/pums-data.html. You’ll probably still want to use the census website above to make your list of variables of interest.

One last note: each individual is located within a household of size 1 or more. The household is identified by SERIALNO and the individual people are numbered starting with 1 in SPORDER, so for household measures you group on SERIALNO and use summarize to calculate the family variables.

Section 3: Predictive Model

For predictive models, remember that the complexity parameter (e.g. lambda for LASSO) reduces the complexity of the data. Thus, you have to start with a very complex model (e.g. lots of interactions) when you write the formula that you wil use, then let the method find the right amount to reduce the complexity such that test RMSE is maximized. In Project 2, you had hundreds of predictors (explanatory variables), so even if you started with a model like TOTALBTU ~ ., you’d still have a decent amount of complexity. But here, if you have a lot, but not hundreds, of predictors, then starting with a model like Y ~ . won’t be very complex. It’s even possible that, in this case, your test RMSE is minimized at the most complex point (e.g. lambda = 0 for LASSO, which is just linear regression).

Make sure you start with a complex specification. If you have panel data (multiple observations of the same unit), make sure you interact the unit with your explanatory variables.

Make sure you use a range of lambdas such that you find the bottom of the “U” shape for the test RMSE. And when plotting, make sure we can see where this occurs. In your plot, make sure that complexity is increasing in the x-axis (for LASSO, you have to reverse the axis as complexity increases as lambda $\to$ 0)

The Memo

Begin your memo with a brief (1 paragraph max) discussion of your topic of interest and the outcome you are interested in understanding and predicting. Tell the reader what sort of questions or analysis you are going to perform, and very briefly (1-2 sentences max) discuss what your analysis finds.

Section 1 of 3: Data (40 points)

Find and assemble a dataset on your topic of interest using existing data.¹ You must incorporate at least two datasets, properly merging them, though more than two is encouraged. This will require careful examination of data types and key(s), as well as attention to handling missing data. Your topic and questions may be shaped in part by what you can find – there are many sources of publicly-available data, and wrestling with them is part of being a good data analyst. To aid in your 3rd section, try to find and use data with many observations.

Once merged, clean your data such that it is usable in your project. Address missing data appropriately. Avoid dropping observations whenever possible. If your data includes extensive categorical variables, consider re-grouping them to relevant, coarser aggregations (e.g. all the building_types in the Rats data from Project 1 could have been grouped into 4-8 more general categories for most analytical uses, though how fine/coarse your grouping should be depends on your analytical questions).

In your memo, make a Data section after your intro paragraph and, in 2-3 paragraphs max, cite the source datasets, and describe how the data was merged. Do not refer to specific column names, but rather speak generally (“we merge on County FIPS and year to link state spending to county-level average test scores….”) so that the reader has an idea of how the data was assembled. In a similar manner, state how missing data was handled.

Briefly summarize the dataset including the number of observations and statistics about key variables (e.g. “the average test score was XX.XX, while the average per-pupil spending was $YYYY”).

Scoring Rubric Section 1

Data Source (15 10 points): Do you have 2 or more data sources, and does each data source provide a relevant contribution to the topic?
Merging (10 points): Did you identify the proper key columns and correctly merge such that data is not distorted or unintentionally dropped? Does your work show technical mastery of the process?
Cleaning (15 points): Did you address missing data appropriately and with clear assumptions? Did you generate meaningful categories for data that contained too much detail? Does your work show technical mastery of the process? Was it efficiently coded?
Memo (5 points): Did you describe the necessary parts of the data construction, merging, cleaning process, and final dataset in a readable and concise maner?

Section 2 of 3: Analysis with Visuals (45 points)

With your dataset(s) cleaned and ready, explore the data. Focusing on 1-2 aspects of the data that you find elucidate the outcome of interest, explore and analyze the relationship between your outcome and the explanatory variables. For example, if you’re interested in state-level educational spending and county-level test scores, you want to go beyond the simple relationship between the two: does the same relationship hold for wealthy counties vs. poor counties? Does it hold for all ages of students? Is it non-linear in some respect?

You will present this analysis primarily through visuals, as in Project 1. Your memo will guide the reader through the visuals, but should let the visuals do the “talking”. As in Project 1, do not narrarate the data exploration process (“we tried X, but it didn’t show anything, so we tried Y”). The memo should communicate a clear flow of logic that starts with the question and ends with a clear pattern in the data not easily expressed in a closed form.

You should not need more than 3 visualizations for this (4 max) – you should not substitute many shallow explorations for one deep exploration. Follow the thread through the data and take the reader along for the ride.

I will be irrationally militant about the fine details in your plots. Your plots should be of publication quality, have perfect labeling and capitalization, have color-blind friendly colors, and should make a striking impact on first glance.

Scoring Rubric Section 2

Content (20 points): Does your analysis find an interesting pattern in the data not easily expressed in a closed form? Does the reader learn something not obvious about the question at hand? Does the analysis “follow a thread” and build in the progression through your 3-4 visualizations? Did we learn something?
Technicals (15 points): Are the visualizations perfect in spelling, capitalization, labeling, and colors?
Memo (10 points): Is the writing succinct, clear, and written as a polished memo without extraneous commentary or narraration?

Section 3 of 3: Predictive Model (40 points)

Your final section requires estimating a predictive model on your data using a method of your choice. Similar to Project 2, you will need to setup your dataset such that it is appropriate for use in your chosen predictive method. You must use LASSO, K-Nearest Neighbors, or a Regression Tree (or random forest, if you wish to be adventureous). You will predict your outcome of interest, basing the model construction (to the extent there is an underlying model structure) on your knowledge of the data gleaned from Section 2.

Start with a very complex specification in your formula. Unless you have thousands of predictors (explanatory variables), then starting with a simple multivariate regression with no interactions won’t get you very far. Remember that your model tuning parameters (e.g. lambda) can decrease complexity, but cannot increase beyond what you start out with. If you have panel data (e.g. multiple observations of the same unit) consider adding interactions between the unit of observation and the explanatory variables. Look back at our linear regression unit to understand the role of interactions.

Set aside 10% of your sample as an evaluation sample as in Project 2. In this project, you may use built-in cross-validation functions (e.g. cv.glmnet() for LASSO/ridge/elastic net), or may choose to cross-validate “manually” as we did in Project 2. If choosing the former, you must be able to extract and show the train and test RMSEs across all the values of the tuning parameters (lambda, K, minsplit, cp, etc.). Estimate the model, extract the relevant RMSE’s, identify your optimal tuning parameters, and calculate the evaluation sample’s final RMSE (a single number). Show a plot of both test and train RMSE over the range of values of the tuning parameter, ensuring that you show a reasonable range of the tuning parameter, that the test-RMSE-minimizing point is clear on the plot, and that the x-axis is increasing in complexity.

Finally, using your evaluation sample, calculate your model’s Evaluation RMSE (a single number) and report it with interpretation.

In this section, the memo writing will be quite brief. In one paragraph, discuss the method you employ and briefly describe the underlying model specification (what variables are included, what isn’t, etc). Show the train-test RMSEs in a properly-formatted publication-ready plot, state the final test RMSE for the optimal tuning parameter values, and discuss and interpret that RMSE of your evaluation sample.

Scoring Rubric Section 3

Estimation (20 points): Did you properly estimate and tune your predictive model using cross-validation? Did you use a model specification (formula) appropriate for your method? Did you identify the test-RMSE minimizing values of the tuning parameter? Did you use an appropriate range of values for the tuning parameter?
Plot (10 points): Did you show the train and test RMSE on an appropriately-formatted plot? Does the plot cover the relevant range of the tuning parameter such that the canonical form of the test RMSE is clear? Does the plot show a line or point at the optimal value of the tuning parameter?
Memo (10 points): Does the memo clearly state the model chosen, the formula used, the methods used, and the results? Does it discuss the evaluation sample RMSE and interpret it appropriately?

Additional Instruction: (10 points)

One person in the group will turn in the assignment to D2L no later than the time listed on the syllabus. I allow you the maximum amount of time to complete the project. Grades are due 48 hours after the due date, and thus I cannot extend anyone’s project for any reason.

You must turn in TWO PDF’s, properly rendered as per our usual course requirements. ONE copy will have all code output turned off (echo=F). The easiest way to do this is the set the global knitr option in the first code chunk (and set that first code chunk’s echo to the appropriate setting), knitting to PDF, then turning the global echo setting on and knitting it again to a differently-named PDF file. You will turn in both PDF’s, and I will use them both in grading.

Scoring Rubric

Turning in (8 points): Did you turn in two copies, one with zero code showing and one with full code? Are all group member names shown on the assignment?
Intro (2 points): Did you start with a proper intro paragraph as specified above?

Some help to get you started

Still having trouble finding data, especially with $N$ large enough for a predictive model?

Once downloaded tigris::pumas() will download the PUMA polygons just like it did counties and tracts. Use ?pumas to learn more.

Footnotes

Note that existing is taken to mean that you are not permitted to collect data by interacting with other people. That is not to say that you cannot gather data that previously has not been gathered into a single place—this sort of exercise is necessary. But you cannot stand with a clipboard outside a store and count visitors (for instance). That would require involving the Institutional Review Board (IRB) and nobody wants to do that.↩︎