Welcome to Fall Semester 2025

Project 1

Turning This In

Only one member of your group should turn in your project on D2L. The copy turned in must have each member’s name at the top.

One group member should turn in your project on D2L under “Project 1” no later than Saturday, October 11th, 11:59PM

Part 1: Rats, rats, rats.

New York City is full of urban wildlife, and rats are one of the city’s most infamous animal mascots. Rats in NYC are plentiful, but they also deliver food, so they’re useful too.

NYC keeps incredibly detailed data regarding animal sightings, including rats, and it makes this data publicly available.

For this first project, pretend that you are an analyst with the NYC Mayor’s office, and you have been tasked with getting a better understanding of the rat problem. Your job is to use R and ggplot2 to tell an interesting story hidden in the data. You must create a story by looking carefully at the data, finding some insight into where or when rat sightings occur, and describing to the Mayor how this insight may inform a strategy to address the rats. Your assignment will take the form of a professional memo that presents the data and insights along with appropriate visualizations that communicate the story in-line. Your memo should be approximately 2-4 pages (including plots) and, of course, will use Rmarkdown rendered to PDF using LaTeX.

Instructions

Here’s what you need to do:

  1. Download New York City’s database of rat sightings since 2010:

  2. Summarize the data somehow. The raw data has more than 150,000 rows, which means you’ll need to aggregate the data (filter(), group_by(), and summarize() will be your friends). Consider looking at the number of sightings per borough, per year, per dwelling type, etc., or a combination of these, like the change in the number sightings across the 5 boroughs between 2010 and 2024. Consider the meaning of variables you create – remember how the total numbers of murders in a state was mostly just a function of size, but that murder rate was a more useful variable. (Of course, I just gave you these ideas, so you’ll need to be a bit more creative than just doing exactly what I just said.)

  3. Explore the data further. Once you have summarized the data in one way, use what you find to further explore other patterns you see in the data. You may bring in other data (it is not hard to find and add to your data the population of each borough or the average daily temperature by month, etc.). You can use case_when() to add data that’s at a coarse level (year, borough, etc.), or merge() if you’re comfortable with it. Be create, but do so to further explore and analyze. Drill down on something you see in the data to tell a story. Remember, as analysts, we are looking for “patterns in the data” that are not easily written in closed form.

  4. Create appropriate visualizations based on the data you summarized and explored. Pay attention to every detail - color choice, labels, axis text, etc. Use proper capitalization. Make sure everything lays out in your final product. This is not trivial - it will take you some time to get everything just right.

  5. Write a polished, professional memo presenting your analysis to the Mayor. The memo should tell the story you found in the data. It should not narrarate your discovery process (e.g. it should not say “we tried looking at…and then ran ggplot(…))” but rather should weave your exploration into a narrarative driven by your graphics, with the text filling in the context. We are specifically looking for a discussion of the following:

    • What story are you telling with your graphics?
    • What new insight have you found?
    • How have you applied reasonable standards in visual storytelling?
    • What policy implication is there (if any)?
  6. Upload the following outputs to D2L:

    • A PDF file of your memo with your graphics embedded in it. This means you’ll need to do all your coding in an R Markdown file and embed your code in chunks.
    • Note that Part 2 of this project should be included in this PDF in it’s own section (see below).
    • Nothing else. No .Rmd, no code, nothing but your clean, polished memo with Part 1 and Part 2.

Some important notes on your assignment

  • Your assignment should be clean and polished as if you were a city employee and you were turning in a work product. It should should flow nicely and use subsections (using ### at the start of the line) as appropriate.
  • Again, do not “annotate” your thought process (e.g. do not write “we tried X but there were too many NA’s, so we did Y instead”). This should be a polished memo suitable for turning in as a work product. Each graphic should guide the analysis towards the next.
  • Your code should not appear in your output - it should be only your plots and memo writing.
    • To turn off code echoing, add echo = FALSE in each of your code chunk options (e.g. {r setup, echo = FALSE}), or set it globally in the first code chunk inside the knitr::opts_chunk$set function.
  • Make sure you take a look at the Evaluation critera, below.

Starter code

I’ve provided some starter code below. A couple comments about it:

  • By default, read_csv() treats cells that are empty or “NA” as missing values. This rat dataset uses “N/A” to mark missing values, so we need to add that as a possible marker of missingness (hence na = c("", "NA", "N/A"))
  • To make life easier, I’ve renamed some of the key variables you might work with. You can rename others if you want.
  • I’ve also created a few date-related variables (sighting_year, sighting_month, sighting_day, and sighting_weekday). You don’t have to use them, but they’re there if you need them. The functions that create these, like year() and wday() are part of the lubridate library.
  • The date/time variables are formatted like 04/03/2017 12:00:00 AM, which R is not able to automatically parse as a date when reading the CSV file. You can use the mdy_hms() function in the lubridate library to parse dates that are structured as “month-day-year-hour-minute”. There are also a bunch of other iterations of this function, like ymd(), dmy(), etc., for other date formats.
  • There’s one row with an unspecified borough, so I filter that out.
library(tidyverse)
library(lubridate)
rats_raw <- read_csv("data/Rat_Sightings.csv", na = c("", "NA", "N/A"))
# If you get an error that says "All formats failed to parse. No formats
# found", it's because the mdy_hms function couldn't parse the date. The date
# variable *should* be in this format: "04/03/2017 12:00:00 AM", but in some
# rare instances, it might load without the seconds as "04/03/2017 12:00 AM".
# If there are no seconds, use mdy_hm() instead of mdy_hms().
rats_clean <- rats_raw %>%
  rename(created_date = `Created Date`,
         location_type = `Location Type`,
         borough = Borough) %>%
  mutate(created_date = mdy_hms(created_date)) %>%
  mutate(sighting_year = year(created_date),
         sighting_month = month(created_date),
         sighting_day = day(created_date),
         sighting_weekday = wday(created_date, label = TRUE, abbr = FALSE)) %>%
  filter(borough != "Unspecified")

You’ll summarize the data with functions from dplyr in the tidyverse, including stuff like count(), arrange(), filter(), group_by(), summarize(), and mutate(). As mentioned before, if you’re bringing in outside data (e.g. population of a borough), you can use case_when() to add it, or try using merge() and seek help at office hours if it’s not clear to you.

Part 2: Data Hunting

For the second part of the project, your task is simple. Your group must identify three different data sources for potential use in your final project. You are not bound to this decision.

Do not use Kaggle data. While Kaggle is useful for learning data science, part of this assignment is learning how to find actual data in the wild. I give zero credit for data links to Kaggle data. You must find your data from the source.

For each dataset, you must write a single short paragraph about what about this data interests the group and what general questions one could answer with it. Add this to the memo from Part 1, using ## to make a new header in Rmarkdown for this section.

Evaluations

I will only give top marks to those groups showing initiative and cleverness. I will use the following weights for final scores:

Part 1

  1. Technical difficulty: Does the final project show mastery of the R and ggplot skills we’ve discussed thus far? Did the underlying data analysis use and expand on the tools we learned in class? Full credit requires going beyond the basic tidyverse filtering / grouping / summarizing. (10 points)

  2. Clarity / use of ggplot: Are all of the design elements property executed in the plots and memo. Are axes and legends appropriately labeled and are colorblind-friendly colors used? Does the plot display properly in the memo without any cut-off axes? Is everything properly spelled? (5 points)

  3. Appropriateness of visuals: Do the visualizations communicate the analysis? Does it use visual cues to show the analysis? (10 points)

  4. Storytelling: Does your memo clearly convey the analysis? Does it flow appropriately with the visuals? Does it include a policy recommendation? (10 points)

Part 2

Each piece of data and description is worth 5 points. (15 points total)

Footnotes

  1. You can approach this in a couple different ways—you can write the memo and then include the full figure and code at the end, similar to this blog post, or you can write the memo in an incremental way, describing the analytical steps that each figure illuminates, ultimately arriving at a final figure, like this blog post.↩︎

  2. The three different sources need not be different websites or from different organizations. For example, three different tables from the US Census would be sufficient↩︎