2: Visualization with ggplot

Due Date

This assignment is due on Monday, January 26th.

All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative. Remember, if you use any AI for coding, you must comment each line with your own interpretation of what that line of code does.

A note on exercises

Instead of putting all of your exercises at the end in 1 section, each of the 3 exercises are contained in a callout box like this:

Exercise 1 of 3

Do this stuff

So find those in the assignment below. There are a lot of questions clearly labeled “OPTIONAL” as well – it helps to at least take a look at them, but only those questions in the labeled boxes are required for this lab assignment.

tidyverse and ggplot2

We start by assuming that you are familiar (and have installed) tidyverse and ggplot2.

The Recipe

  1. Tell the ggplot() function what our data is.
  2. Tell ggplot() what relationships we want to see. For convenience we will put the results of the first two steps in an object called p.
  3. Tell ggplot how we want to see the relationships in our data.
  4. Layer on geoms as needed, by adding them on the p object one at a time.
  5. Use some additional functions to adjust scales, labels, tickmarks, titles.
  • e.g. scale_, labs(), and guides() functions

As you start to run more R code, you’re likely to run into problems. Don’t worry — it happens to everyone. I have been writing code in numerous languages for years, and every day I still write code that doesn’t work. Sadly, R is particularly persnickity, and its error messages are often opaque.

Start by carefully comparing the code that you’re running to the code in these notes. R is extremely picky, and a misplaced character can make all the difference. Make sure that every ( is matched with a ) and every ” is paired with another “. Sometimes you’ll run the code and nothing happens. Check the left-hand of your console: if it’s a +, it means that R doesn’t think you’ve typed a complete expression and it’s waiting for you to finish it. In this case, it’s usually easy to start from scratch again by pressing ESCAPE to abort processing the current command.

One common problem when creating ggplot2 graphics is to put the + in the wrong place: it has to come at the end of the line, not the start.

Mapping Aesthetics vs Setting them

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp, color = 'yellow'))
p + geom_point() + scale_x_log10()

This is interesting (or annoying): the points are not yellow. How can we tell ggplot to draw yellow points?

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp, ...))
p + geom_point(...) + scale_x_log10()

Try it: (OPTIONAL) describe in your words what is going on. One way to avoid such mistakes is to read arguments inside aes(<property> = <variable>)as the property in the graph is determined by the data in .

Aesthetics convey information about a variable in the dataset, whereas setting the color of all points to yellow conveys no information about the dataset - it changes the appearance of the plot in a way that is independent of the underlying data.

Remember: color = 'yellow' and aes(color = 'yellow') are very different, and the second makes usually no sense, as 'yellow' is treated as data.

p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp))
p + geom_point() + geom_smooth(color = "orange", se = FALSE, size = 8, method = "lm") + scale_x_log10()

Try it: (OPTIONAL) Write down what all those arguments in geom_smooth(...) do.

Coloring by continent:

library(scales)
p <- ggplot(data = gapminder,
            mapping = aes(x = gdpPercap, y = lifeExp, color = continent, fill = continent))
p + geom_point()
p + geom_point() + scale_x_log10(labels = dollar)
p + geom_point() + scale_x_log10(labels = dollar) + geom_smooth()

Try it: (OPTIONAL) What does fill = continent do? What do you think about the match of colors between lines and error bands?

Tidy up your data, then plot

For Exercise 3, use this data on MI county-level income, age, population, and yearly commute time, which can be loaded using the following line of code:

midata = read.csv('https://ec242.netlify.app/data/milong.csv', stringsAsFactors = F)
Exercise 3 of 3

First, use pivot_wider to get the above data into tidy form (each observation should be a county). This data is taken from the 2023 US Census American Community Survey, which we’ll learn about later on.

Then, categorize each county into “Small” and “Large” based on population. Do the same for age (“Young” and “Old”). A logical cut-point would be the median.

Then, plot the relationship between two of the variables of your choice using the geometry of your choice. Use an aesthetic mapping of your choice on the young/old or small/large category to illustrate how the relationship may differ across categories. If you’re aesthetic mapping young/old to an aesthetic, do not also use it on the X or Y axis (that would be redundant).

Label the axes and the legend with clear, easy to understand language. Below your plot, write a few sentences to describe what you’ve visualized and interpret the visualization.