Project 2

Note: this assignment was updated 11/26/2024. Some parts may be different from the assignment as it stood when due. All additions are clarifications.

Predicting Energy Consumption

The goal of this project is to predict household energy consumption using the Department of Energy’s Residential Energy Consuption Survey for 2020 using one of the predictive models you have learned: LASSO, elastic net, ridge, KNN, or trees. The RECS is a sample of around 18,000 households that asks myriad questions about housing characteristics (age of house, size of house), appliances (heat pump, gas furnace, EV), and total energy consumption (measured in BTU’s such that electricity, gas, propane, etc. are all included). This is microdata, so each household is an observation in the data.

This survey exists to provide data to utilities (power companies, gas companies, etc.) and state government’s for planning future energy consumption so that grids and utilities can be prepared for trends in energy use. Usually, this sort of analysis would be done in an interpretable way (modeling consumption with, say, and EV versus without tells you something about how aggregate energy use will look if more people have EV’s). For this project, though, we just want a good and useful prediction of energy consumption. And so, we will bring our prediction model tools to bear on the data.

Your task is to develop, estimate, and evaluate a predictive model of the variableTOTALBTU. To compare across groups, we will hold out around 5% of the sample that will be the “evaluation” sample. The group with the lowest evaluation sample RMSE wins fame and fortune beyond comprehension. Also, +1% of your total grade in extra credit.

The RECS Data

Part of being a data analyst is learning your data. While our predictive models don’t necessarily need us to know every variable’s definition, we do need to use the data correctly. So part of your task will be to import, clean, process, and understand the data. Most categorical variables (like “Year Built”) are held in the data as integers, but represent ranges of years. You’ll need to make sure you’re handling these correctly. There are also many variables that likely aren’t going to help: for instance, “flags” probably shouldn’t be included (they are variables that indicate when an observation has some values that are missing or imputed). Use the “variable and response codebook” (also known as the data dictionary) understand the variables. Make sure you know the units you’re working in as well.

Each observation has a TOTALBTU that you want to predict. There are also subcategories of TOTALBTU that you will need to drop right away: TOTALBTUSPH, TOTALBTUWTH, and TOTALBTUOTH. These are subsets of TOTALBTU representing space heating, water heating, and other, and will add up to TOTALBTU. Do not use them.

It will take you some time to digest the data, so you won’t start predicting right out of the gate. Take your time to understand the data, and explore it as you work your way through the data dictionary. For many of our predictive models, you have to specify (possibly many, many, many) potential interactions. You’ll need to have some idea of what might be an important interaction to do this. For instance, HDD65 tells you how many cold days the household had. An interaction with TOTSQFT is probably useful to include.

Data Details

On the microdata tab for RECS 2020 https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata, down at the bottom, you’ll find a link to the .csv file (ignore the SAS file), as well as an Excel “variable and response codebook”. The codebook has five columns: Variable contains the name of the variable exactly how you’ll find it in the .csv, Type tells you if the variable is numeric or character or logical. When it says “numeric” it doesn’t mean the variable itself is represented as a number – note that CELLAR (row 17) is represented as a numeric, but the numbers map to values of “1 Yes”, “0 No” and “-2 Not applicable”. Column 3 is the description of the data, and Column 4 contains all of the response codes which tell you what each value means. Note that including something like CELLAR without making it a factor variable will use it in a bizarre and unhelpful way (as a continuous number). Column 5 tells you what section of the survey the data comes from.

While not super-helpful for this application, the survey itself can be viewed on the microdata page. It can help in undertanding the responses.

Clarifications

This is a new project, so there may be some unclear descriptions or details that are necessary. I’ll add questions and answers here as they come in.

I added a short amount on 11/12 regarding the structure of the table I want you to include. If you have more than 100 values for your tuning parameters (lambda, k, etc.) then shorten the range you use so that you have under 200 rows on that table. Make sure your test-RMSE-minimizing level of complexity / tuning parameter value is in the middle of the range.

The project

This project will take the form of a short memo-style documentation and evaluation of your predictive model prepared using R and RMarkdown as usual. It will have five sections, each of which should be separated with a proper header.

Data Processing (10 points): Clean and process the RECS 2020 data starting by loading in the data .csv. Below, in the “Your approach should be…” section, I’ve included some important steps you must take. For this section, use echo=T in your code chunk(s) so that your cleaning and processing code is shown. It’s OK if it’s long (though try to be an efficient coder). All cleaning and processing should be shown here. If you find yourself taking extra steps later on with the data, move those steps up into this section. No write-up or description is required here, just your code.
State your priors (6 points): In 1-2 paragraphs, discuss your prior beliefs about the relationship between three of the variables and the target variable, TOTALBTU. While our predictive models often become un-interpretable, we have to start with some set of beliefs about the relationship, starting with the existence of some relationship. You cannot choose HDD65 or CDD65, though. These are too obvious. Make one plot of the relationship between TOTALBTU and some variable(s) that you think will be important and include that plot in this section. The plot has to be clean and clear, but the analysis behind it can be very simple and straightforward (this assignment is about the prediction, not a re-hash of Project 1). Do not include any of the code here – use echo=F to ensure that your code is not showing. You will turn in your .RMD file on D2L as well should the grader need to see your code.
Estimate a model (20 points): Estimate a robust and detailed predictive model that includes the relationship from (2) and anything and everything else you think will provide predictive power – the sky is the limit. Use one of the machine learning methods that we learned about (kNN, regression trees, LASSO/Ridge/Elastic Net, do not use regular OLS) and carefully use the train-test paradigm to arrive at your best prediction.

Some rules apply here:

Do not use the “canned” routines for finding the optimal values of your tuning parameter – you’ll need to set a range of complexity and use lapply or a loop to estimate over those complexity parameters. This means that you should not directly give glmnet (if using LASSO or Ridge) a vector of lambdas and should not use cv.glmnet, but rather should use lapply to estimate different models over a range of lambda (or k or whatever your tuning parameters are).
Do not output your estimation code or underlying output. Your output should be clean, but you will need to upload your .RMD file for your assignment as well.

Some hints may come in handy:

Most of methods we learned start out with a specification formula that is highly complex – unlike when we added variables to change complexity, our machine learning methods take a very deatiled formula with polynomial terms and interactions, and use the tuning parameters to change the complexity of the model. So start with a very complex model in your formula or data.
As you tune your tuning parameters, you may find that a large part of the tuning parameter space (the values of the tuning parameters you’re trying) result in a flat, high RMSE. As we discussed in class, there is no “right” range for the tuning parameters, so you may need to adjust the range you cover so that you focus on the RMSE-minimizing values. If you’re in a tuning parameter range that results in very high RMSE, then your plot is going to “compress” the minimizing point down and you likely won’t be able to even tell there’s a minimum. Be smart and iterate the range for your tuning parameter estimates.

In the write-up for this section (section 3), include a table showing the range of complexity that was tried (e.g. the lambda from a LASSO), the training RMSE, and the testing RMSE. The table will be long, but that’s OK. If you have more than 200 different values, then shorten the range around the test-RMSE-minimizing point. Use round() to cut down on the number of digits showing, but make sure enough are showing to find the minimizing level of complexity. Make a plot showing the progression of both RMSE’s as complexity increases. The plot should show the nadir where the testing RMSE reaches it’s minimum. If you are using a method that has two complexity parameters, you’ll need to include test and train RMSE’s for every combination of parameter values. In that case, choose a coarse number of parameter values in the complexity parameters so that you are not searching over too many values. You can plot across one complexity parameter with the other complexity parameter held fixed. Your write-up should not include any code in this section, just the table and the plot. You can try using knitr::kable() to output a table - this function converts data.frame-like R objects into clean tables for RMarkdown. We haven’t covered it in class, but it should be easy to figure out.

Finally, in a 2-3 sentence discussion, note which model you have selected as your optimal model and why.

Evaluate the Model (8 points): Take your model to the evaluation sample (see below) and calculate the RMSE for your optimal model on this sample. Then, see if your priors were correct by changing the variable you hypothesized to be related, and see if your predictive model changes it’s prediction accordingly. For instance, I think an increase in HDD65 will increase the energy consumed when it increases. To see if this holds in my model, I can increase HDD65 by some amount (a percent or a fixed amount, or a change in category if a categorical/factor variable) across all the data, and compare the new prediction to the old. Do this for each of the three prior predictions you stated in (2).
Data and Hypotheses (6 points): Go back to the three datasets from your first project. Choose one that you think you, as a group, might want to study further. Write up three hypothesis (labeled H1, H2, and H3) that you might want to test or predictions you might want to make (you can mix if you’d like - two hypothesis and one prediction, etc.) and why those might be interesting to others. That is, tell me why you or others would be interested in testing each hypothesis, or predicting an outcome. This is unrelated to Sections 1-4.

Your data processing approach should be:

Import the data from RECS at https://www.eia.gov/consumption/residential/data/2020/csv/recs2020_public_v7.csv
Drop TOTALBTUSPH, TOTALBTUWTH, TOTALBTUOTH, and DOEID
Make sure all variables that are numeric but represent factor variables (e.g. CELLAR) are converted into factors. Do not convert variables like HDD65 that are meant to be numeric. You can use the function mutate(across(c(X,Y,Z), as.factor)) to convert columns X, Y, and Z (obviously, substitute your columns) to factor variables. Check to make sure you got it right using str!
Digest and understand your data. The data is described in detail using codebooks found here: https://www.eia.gov/consumption/residential/data/2020/index.php?view=microdata.
Use set.seed(2422024) and then immediately do the following:
- Sample exactly 1,000 households to hold as the final evaluation data. This is neither test nor train, but a third holdout “evaluation” set with which we will evaluate the final model.
- Once you have removed the 1,000 household final evaluation data, split the remaining sample into test and train. Stick to 50/50. This should give you equal samples of 8,748 in each. You’ll have a train, a test, and an evaluation sample, all mutually exclusive and adding up to 18,496 rows total.

Evaluation

The project is worth a total of 50 points. I will evaluate according to the following rubric:

Section 1 (10 points)

Did you drop the stated variables?
Did you convert factors correctly?
Did you correctly sample your data into three sets?
Did you show your processing code?

Part 2 (6 points, 2 points per prior)

Do your three priorsand their description make sense?
Does your plot show the relationship?
Does the plot meet the technical requirements (properly labeled, clear, colorblind-friendly, etc.)

Part 3 (20 points)

Did you estimate the predictive model properly without using a “canned” routing (e.g. manually testing across a range of complexity)?
Does your table show the relationship between complexity and the train and test RMSEs?
Is it nicely formatted and clear to read?
Did you suppress the code and model results so that only the RMSE table, plot, and discussion is showing?
Does the plot meet the technical requirements (properly labeled, clear, colorblind-friendly, etc.)?s

Part 4 (8 points)

Do you show the RMSE for the evaluation sample?
Did you propertly test your three priors?
Is the discussion clear and concise?

Part 5 (6 points, 2 points per hypothesis/prediction)

Did you state actual hypotheses – specific questions that are testable?
For proposed predictions, did you state the variables that would be predicted?
In both cases, did you properly describe why one might be interested in the prediction or hypothesis?

Turning it in

You are required to turn in both your final, rendered project with no code showing (except as stated in Section 1) AND your .RMD file used to render the project. All code generated with generative AI (e.g. ChatGPT) must be labeled as such as required in the syllabus..

Only one group member should post on D2L. Please make sure you coordinate to ensure this is done.