<- 1
a <- 1
b <- -1 c
Working with R and RStudio
Frontmatter (Spring 2025)
I like to use this spot to publish course announcements. Not so much for y’all, but more so I remember. If you see any announcements that don’t say “Spring 2025” there’s a good chance it’s leftover from earlier course offerings. That’s a reasonable indicator that you have jumped ahead… or… I forgot to edit something.
Participation extra credit
A careful read of our syllabus under “class participation” will show that I do give extra credit for answering questions and (mainly) sharing completed R
coding tasks. That is, we’ll walk through some examples, and when we hit a box that looks like this:
Do some stuff in R
I’ll ask you to give it a go. Then, after a few minutes, I’ll ask if anyone wants to share their answer. You get one point of participation extra credit. Five points is worth 1% of a grade boost, so these aren’t negligible points. Max boost over the semester is 5%.
Assignment / Exercises
The Assignments page has all of our weekly lab assignments (including Week 1, due on Monday at 11:59pm). The assignments often have a preamble and some code that has to be used to set you up for the questions. The questions to be completed and turned in are under “Exercises” at the very end.
Introduction to Applications
Applications in this class are designed to be presented in-class. Accordingly, the notes here are not comprehensive. Instead, they are intended to guide students through some practical thing.
I’m also aware that my writing is dry and lifeless. If you’re reading this online without the advantage of seeing it in person, don’t worry—I’ll be “funnier” in class.1
R
basics
The course content below should be considered a prerequisite for success. For those concerned about basics of R
, you absolutely must read this content and attempt the coding exercises. If you struggle to follow the content, please contact the professor or TA.
Making sure we’re on the same page
If you have not yet installed R and RStudio, and if you have not yet successfully rendered the “weekly writing” template, then please make sure you do so today. Use the resources page for instructions.
We have created a video walkthrough for the basics of using R
for another course, but it is useful here. You can see part A here (labeled “Part 2a”) here ] and part B here (labeled “Part 2b”) . You should already be at this level of familiarity with R, but if you need a review, this is a good place to start.
The (very) basics of R
Before we get started with the motivating dataset, we need to cover the very basics of R
.
Console and Script
Your Rstudio has two main areas in which code is written. The console appears at the bottom of your screen. You can interact directly with R through the console. Your script editor is at the top of your screen. This is where code that you want to save is written, usually in an order such that an entire set of commands can be written top-to-bottom and will run with the desired result. In the script editor, you can highlight lines of code and use command+enter (mac) or ctrl+enter (windows) to run just the bit of code.
In class, you’ll likely want to copy bits of code into a blank document and, at the end, save the document as notes, just for your reference. Remember, if you put something directly into the console, you won’t have a record of it. Putting it into a script will keep a version of it.
Rmarkdown
Rmarkdown lets us combine the processing and output from R code with text and headers written in plain english, which lets us do something in code and then show it and discuss it in one place.
An .Rmd (Rmarkdown document) like your lab and weekly writing templates has three parts: first, a YAML header up at the top that establishes some variables for use in rendering to PDF. Second, “code chunks” that are processed by R. And third, markdown text that is processed as normal text (via markdown langugage). You do work in code chunks, the output is included in the document, and you discuss the results in-line. Make sure you read using Rmarkdown and using markdown before you do your first weekly reading.
Your R code goes here in these “chunks”. In the upper right, you’ll see a green down-pointing triangle and a green right-pointing triangle. The first one (down-pointing) runs all of the previous code chunks up to this one while the second (right-pointing) runs this code chunk.
This is very useful when you are iterating through steps to develop your code. Running a code chunk will show you the output from that code chunk, which is what will drop into your .Rmd file when you knit it. Note that you can also highlight code and use CTRL+ENTER (or CMD+ENTER for macs) to run code.
The header on the code chunk tells Rstudio what language to use to run the chunk (r
), and can take some settings for displaying output. The one you want to know now is echo=T
. When this is TRUE
(the document’s default), then knitting will include a copy of your code along with the output. Don’t change this to FALSE
or I can’t see your work when grading.
install.packages()
You should never, ever have install.packages()
in your recorded code (in your “code” file if using an R script, or in your code chunks in a .Rmd). If it’s in your .Rmd file, when you “knit”, it’ll try to install the packages and will get very confused. I saw a lot of install.packages()
in code, so make sure to take them out.
To use a package, you include (in your .Rmd) library(packageName)
. That goes in your code, usually first thing.
Now, on to the use of R
Objects
Suppose a relatively math unsavvy student asks us for help solving several quadratic equations of the form
which of course depend on the values of
One advantage of programming languages is that we can define variables and write expressions with these variables, similar to how we do so in math, but obtain a numeric solution. We will write out general code for the quadratic equation below, but if we are asked to solve
which stores the values for later use. We use <-
to assign values to the variables.
We can also assign values using =
instead of <-
, but some recommend against using =
to avoid confusion.2
Copy and paste the code above into your console (or use the “copy code” button in the box) to define the three variables. Note that R
does not print anything when we make this assignment. This means the objects were defined successfully. Had you made a mistake, you would have received an error message. Throughout these written notes, you’ll have the most success if you continue to copy code into a blank R script or into your own console.
To see the value stored in a variable, we simply ask R
to evaluate a
and it shows the stored value:
a
[1] 1
A more explicit way to ask R
to show us the value stored in a
is using print
like this:
print(a)
[1] 1
By default, just running a
in the console results in the assuption that you wanted to print out a
.
We use the term object to describe stuff that is stored in R
. Variables are examples, but objects can also be more complicated entities such as functions, which are described later.
The workspace
As we define objects in the console, we are actually changing the workspace. You can see all the variables saved in your workspace by typing:
ls()
[1] "a" "b" "c" "filter"
In RStudio Posit, the Environment tab shows the values:
We should see a
, b
, and c
. If you try to recover the value of a variable that is not in your workspace, you receive an error. For example, if you type x
you will receive the following message: Error: object 'x' not found
.
Now since these values are saved in variables, to obtain a solution to our equation, we use the quadratic formula:
-b + sqrt(b^2 - 4*a*c) ) / ( 2*a ) (
[1] 0.618034
-b - sqrt(b^2 - 4*a*c) ) / ( 2*a ) (
[1] -1.618034
Copy and paste the code above into your console (or use the “copy code” button in the box) to define the three variables, and the code used to implement the quadratic formula. Note that R
does not print anything when we make the code assignment for a
, b
, and c
. This means the objects were defined successfully. Had you made a mistake, you would have received an error message. Put them all together and run them to see the solutions. Note that the two lines that define the solution do not assign the result to a new object; thus, they get printed in the output.
Throughout these written notes, you’ll have the most success if you continue to copy code into a blank R script or into your own console.
Functions
Once you define variables, the data analysis process can usually be described as a series of functions applied to the data. R
includes several zillion predefined functions and most of the analysis pipelines we construct make extensive use of the built-in functions. But R
’s power comes from its scalability. We have access to (nearly) infinite functions via install.packages
and library
. As we go through the course, we will carefully note new functions we bring to each problem. For now, though, we will stick to the basics.
Note that you’ve used a function already: you used the function sqrt
to solve the quadratic equation above. These functions do not appear in the workspace because you did not define them, but they are available for immediate use.
In general, we need to use parentheses to evaluate a function. If you type ls
, the function is not evaluated and instead R
shows you the code that defines the function. If you type ls()
the function is evaluated and, as seen above, we see objects in the workspace.
Unlike ls
, most functions require one or more arguments. Below is an example of how we assign an object to the argument of the function log
. Remember that we earlier defined a
to be 1:
log(8)
[1] 2.079442
log(a)
[1] 0
You can find out what the function expects and what it does by reviewing the very useful manuals included in R
. You can get help by using the help
function like this:
help("log")
For most functions, we can also use this shorthand:
?log
The help page will show you what arguments the function is expecting. For example, log
needs x
and base
to run. However, some arguments are required and others are optional. You can determine which arguments are optional by noting in the help document that a default value is assigned with =
. Defining these is optional.3 For example, the base of the function log
defaults to base = exp(1)
—that is, log
evaluates the natural log by default.
If you want a quick look at the arguments without opening the help system, you can type:
args(log)
function (x, base = exp(1))
NULL
You can change the default values by simply assigning another object:
log(8, base = 2)
[1] 3
Note that we have not been specifying the argument x
as such:
log(x = 8, base = 2)
[1] 3
The above code works, but we can save ourselves some typing: if no argument name is used, R
assumes you are entering arguments in the order shown in the help file or by args
. So by not using the names, it assumes the arguments are x
followed by base
:
log(8,2)
[1] 3
If using the arguments’ names, then we can include them in whatever order we want:
log(base = 2, x = 8)
[1] 3
To specify arguments, we must use =
, and cannot use <-
.
There are some exceptions to the rule that functions need the parentheses to be evaluated. Among these, the most commonly used are the arithmetic and relational operators. For example:
2 ^ 3
[1] 8
You can see the arithmetic operators by typing:
help("+")
or
"+" ?
and the relational operators by typing:
help(">")
or
">" ?
Never use ?
in your code. The help operator, ?...
, should only be used directly in the console. If you put it in your code, it’ll keep opening the help, and when you include it in an Rmarkdown document, it’ll behave strangely. Don’t do it!
Other prebuilt objects
There are several datasets that are included for users to practice and test out functions. You can see all the available datasets by typing:
data()
This shows you the object name for these datasets. These datasets are objects that can be used by simply typing the name. For example, if you type:
co2
R
will show you Mauna Loa atmospheric
Other prebuilt objects are mathematical quantities, such as the constant
pi
[1] 3.141593
Inf+1
[1] Inf
Variable names
We have used the letters a
, b
, and c
as variable names, but variable names can be almost anything. Some basic rules in R
are that variable names have to start with a letter, can’t contain spaces, and should not be variables that are predefined in R
. For example, don’t name one of your variables install.packages
by typing something like install.packages <- 2
. Usually, R
is smart enough to prevent you from doing such nonsense, but it’s important to develop good habits.
A nice convention to follow is to use meaningful words that describe what is stored, use only lower case, and use underscores as a substitute for spaces. For the quadratic equations, we could use something like this:
<- (-b + sqrt(b^2 - 4*a*c)) / (2*a)
solution_1 <- (-b - sqrt(b^2 - 4*a*c)) / (2*a) solution_2
For more advice, we highly recommend studying (Hadley Wickham’s style guide)[http://adv-r.had.co.nz/Style.html].
Saving your workspace
Values remain in the workspace until you end your session or erase them with the function rm
. But workspaces also can be saved for later use. In fact, when you quit R, the program asks you if you want to save your workspace. If you do save it, the next time you start R, the program will restore the workspace.
We actually recommend against saving the workspace this way because, as you start working on different projects, it will become harder to keep track of what is saved. Instead, we recommend you assign the workspace a specific name. You can do this by using the function save
or save.image
. To load, use the function load
. When saving a workspace, we recommend the suffix rda
or RData
. In RStudio, you can also do this by navigating to the Session tab and choosing Save Workspace as. You can later load it using the Load Workspace options in the same tab. You can read the help pages on save
, save.image
, and load
to learn more.
Motivating scripts
To solve another equation such as
<- 3
a <- 2
b <- -1
c -b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a) (
By creating and saving a script with the code above, we would not need to retype everything each time and, instead, simply change the variable names. Try writing the script above into an editor and notice how easy it is to change the variables and receive an answer.
The answer you get from the 4th and 5th lines will depend on the values of a
, b
, and c
. If you were to type new numbers directly into your console: c = 5.33
then re-run the last two lines, you will get a different answer. Your R “environment” is affected by what is run from a script and by what you type in the console. It is good (and necessary) practice to write all your code in a script (or in your Rmarkdown document), and run from the script. Always. Periodically running a script fresh from the start (clearing everything out of the environment first) is a good idea as well.
What is the sum of the first 100 positive integers? The formula for the sum of integers
through is . Define and then useR
to compute the sum of through using the formula. What is the sum?Now use the same formula to compute the sum of the integers from 1 through 1,000.
Look at the result of typing the following code into R:
<- 1000
n <- seq(1, n)
x sum(x)
Based on the result, what do you think the functions seq
and sum
do? You can use help
.
sum
creates a list of numbers andseq
adds them up.seq
creates a list of numbers andsum
adds them up.seq
creates a random list andsum
computes the sum of 1 through 1,000.sum
always returns the same number.
In math and programming, we say that we evaluate a function when we replace the argument with a given number. So if we type
sqrt(4)
, we evaluate thesqrt
function. In R, you can evaluate a function inside another function. The evaluations happen from the inside out. Use one line of code to compute the log, in base 10, of the square root of 100.Which of the following will always return the numeric value stored in
x
? You can try out examples and use the help system if you want.
log(10^x)
log10(x^10)
log(exp(x))
exp(log(x, base = 2))
Commenting your code
If a line of R
code starts with the symbol #
, it is not evaluated. We can use this to write reminders of why we wrote particular code. For example, in the script above we could add:
## Code to compute solution to quadratic equation of the form ax^2 + bx + c
## define the variables
<- 3
a <- 2
b <- -1
c
## now compute the solution
-b + sqrt(b^2 - 4*a*c)) / (2*a)
(-b - sqrt(b^2 - 4*a*c)) / (2*a) (
Data types
Variables in R
can be of different types. For example, we need to distinguish numbers from character strings and tables from simple lists of numbers. The function class
helps us determine what type of object we have:
<- 2
a class(a)
[1] "numeric"
To work efficiently in R, it is important to learn the different types of variables and what we can do with these.
Data frames
Up to now, the variables we have defined are just one number. This is not very useful for storing data. The most common way of storing a dataset in R
is in a data frame. Conceptually, we can think of a data frame as a table with rows representing observations and the different variables reported for each observation defining the columns. Data frames are particularly useful for datasets because we can combine different data types into one object.
A large proportion of data analysis challenges start with data stored in a data frame. For example, we stored the data for our motivating example in a data frame. You can access this dataset by loading the dslabs library and loading the murders
dataset using the data
function:
library(dslabs)
data(murders)
To see that this is in fact a data frame, we type:
class(murders)
[1] "data.frame"
Installing Packages
Woah, there – data("murder")
gave me an error! Well, the data, like many functions, are part of a package. Here, it’s the dslabs
package. We need to install the package before we can use it’s functions or data.
To install a new package, we use install.packages('packageName')
(where hopefully-obviously “packageName” is the name of the package you want to install). We type this once directly into the console, which will add the package to our computer permanently (unless you delete it). Then, to use it thereafter, we use only library(dslabs)
.
Note that there are quotes when using install.packages
but no quotes when using library
.
Examining an object
The function str
is useful for finding out more about the structure of an object:
str(murders)
'data.frame': 51 obs. of 5 variables:
$ state : chr "Alabama" "Alaska" "Arizona" "Arkansas" ...
$ abb : chr "AL" "AK" "AZ" "AR" ...
$ region : Factor w/ 4 levels "Northeast","South",..: 2 4 4 2 4 4 1 2 2 2 ...
$ population: num 4779736 710231 6392017 2915918 37253956 ...
$ total : num 135 19 232 93 1257 ...
This tells us much more about the object. We see that the table has 51 rows (50 states plus DC) and five variables. We can show the first six lines using the function head
:
head(murders)
state abb region population total
1 Alabama AL South 4779736 135
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
4 Arkansas AR South 2915918 93
5 California CA West 37253956 1257
6 Colorado CO West 5029196 65
In this dataset, each state is considered an observation and five variables are reported for each state.
Before we go any further in answering our original question about different states, let’s learn more about the components of this object.
The accessor: $
For our analysis, we will need to access the different variables represented by columns included in this data frame. To do this, we use the accessor operator $
in the following way:
$population murders
[1] 4779736 710231 6392017 2915918 37253956 5029196 3574097 897934
[9] 601723 19687653 9920000 1360301 1567582 12830632 6483802 3046355
[17] 2853118 4339367 4533372 1328361 5773552 6547629 9883640 5303925
[25] 2967297 5988927 989415 1826341 2700551 1316470 8791894 2059179
[33] 19378102 9535483 672591 11536504 3751351 3831074 12702379 1052567
[41] 4625364 814180 6346105 25145561 2763885 625741 8001024 6724540
[49] 1852994 5686986 563626
But how did we know to use population
? Previously, by applying the function str
to the object murders
, we revealed the names for each of the five variables stored in this table. We can quickly access the variable names using:
names(murders)
[1] "state" "abb" "region" "population" "total"
It is important to know that the order of the entries in murders$population
preserves the order of the rows in our data table. This will later permit us to manipulate one variable based on the results of another. For example, we will be able to order the state names by the number of murders.
Tip: R
comes with a very nice auto-complete functionality that saves us the trouble of typing out all the names. Try typing murders$p
then hitting the tab key on your keyboard. This functionality and many other useful auto-complete features are available when working in RStudio.
Vectors: numerics, characters, and logical
The object murders$population
is not one number but several. We call these types of objects vectors. A single number is technically a vector of length 1, but in general we use the term vectors to refer to objects with several entries. The function length
tells you how many entries are in the vector:
<- murders$population
pop length(pop)
[1] 51
This particular vector is numeric since population sizes are numbers:
class(pop)
[1] "numeric"
In a numeric vector, every entry must be a number.
To store character strings, vectors can also be of class character. For example, the state names are characters:
class(murders$state)
[1] "character"
As with numeric vectors, all entries in a character vector need to be a character.
Another important type of vectors are logical vectors. These must be either TRUE
or FALSE
.
<- 3 == 2
z z
[1] FALSE
class(z)
[1] "logical"
Here the ==
is a relational operator asking if 3 is equal to 2. In R
, if you just use one =
, you actually assign a variable, but if you use two ==
you test for equality. Yet another reason to avoid assigning via =
… it can get confusing and typos can really mess things up.
You can ask if multiple things in a vector are equal to one specific thing:
$total == 5 murders
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[25] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE
[37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[49] FALSE FALSE TRUE
That gives you 51 answers to the question “is this value of murders$total
equal to 5? The answer is a vector of logicals.
You can see the other relational operators by typing:
?Comparison
In future sections, you will see how useful relational operators can be.
We discuss more important features of vectors after the next set of exercises.
Advanced: Mathematically, the values in pop
are integers and there is an integer class in R
. However, by default, numbers are assigned class numeric even when they are round integers. For example, class(1)
returns numeric. You can turn them into class integer with the as.integer()
function or by adding an L
like this: 1L
. Note the class by typing: class(1L)
Factors
In the murders
dataset, we might expect the region to also be a character vector. However, it is not:
class(murders$region)
[1] "factor"
It is a factor. Factors are useful for storing categorical data. We can see that there are only 4 regions by using the levels
function:
levels(murders$region)
[1] "Northeast" "South" "North Central" "West"
In the background, R
stores these levels as integers and keeps a map to keep track of the labels. This is more memory efficient than storing all the characters. It is also useful for computational reasons we’ll explore later.
Note that the levels have an order that is different from the order of appearance in the factor object. The default in R
is for the levels to follow alphabetical order. However, often we want the levels to follow a different order. You can specify an order through the levels
argument when creating the factor with the factor
function. For example, in the murders dataset regions are ordered from east to west. The function reorder
lets us change the order of the levels of a factor variable based on a summary computed on a numeric vector. We will demonstrate this with a simple example, and will see more advanced ones in the Data Visualization part of the book.
Suppose we want the levels of the region by the total number of murders rather than alphabetical order. If there are values associated with each level, we can use the reorder
and specify a data summary to determine the order. The following code takes the sum of the total murders in each region, and reorders the factor following these sums.
<- murders$region
region <- murders$total
value <- reorder(region, value, FUN = sum)
region levels(region)
[1] "Northeast" "North Central" "West" "South"
The new order is in agreement with the fact that the Northeast has the least murders and the South has the most.
Warning: Factors can be a source of confusion since sometimes they behave like characters and sometimes they do not. As a result, confusing factors and characters are a common source of bugs.
Lists
Data frames are a special case of lists. We will cover lists in more detail later, but know that they are useful because you can store any combination of different types. In a data.frame, all columns have to be vectors of the same length (equal to the number of rows in the data.frame). In a list, each item can be of any length and of any type. Below is an example of a list we created for you:
record
$name
[1] "John Doe"
$student_id
[1] 1234
$grades
[1] 95 82 91 97 93
$final_grade
[1] "A"
class(record)
[1] "list"
As with data frames, you can extract the components of a list with the accessor $
. In fact, data frames are a type of list.
$student_id record
[1] 1234
We can also use double square brackets ([[
) like this:
"student_id"]] record[[
[1] 1234
You should get used to the fact that in R
there are often several ways to do the same thing. such as accessing entries.4
You might also encounter lists without variable names.
record2
[[1]]
[1] "John Doe"
[[2]]
[1] 1234
If a list does not have names, you cannot extract the elements with $
, but you can still use the brackets method and instead of providing the variable name, you provide the list index, like this:
1]] record2[[
[1] "John Doe"
We won’t be using lists until later, but you might encounter one in your own exploration of R
. For this reason, we show you some basics here.
Matrices
Matrices are another type of object that are common in R
. Matrices are similar to data frames in that they are two-dimensional: they have rows and columns. However, like numeric, character and logical vectors, entries in matrices have to be all the same type. For this reason data frames are much more useful for storing data, since we can have characters, factors, and numbers in them.
Yet matrices have a major advantage over data frames: we can perform matrix algebra operations, a powerful type of mathematical technique. We do not describe these operations in this class, but much of what happens in the background when you perform a data analysis involves matrices. We describe them briefly here since some of the functions we will learn return matrices.
We can define a matrix using the matrix
function. We need to specify the number of rows and columns.
<- matrix(1:12, 4, 3)
mat mat
[,1] [,2] [,3]
[1,] 1 5 9
[2,] 2 6 10
[3,] 3 7 11
[4,] 4 8 12
You can access specific entries in a matrix using square brackets ([
). If you want the second row, third column, you use:
2, 3] mat[
[1] 10
If you want the entire second row, you leave the column spot empty:
2, ] mat[
[1] 2 6 10
Notice that this returns a vector, not a matrix.
Similarly, if you want the entire third column, you leave the row spot empty:
3] mat[,
[1] 9 10 11 12
This is also a vector, not a matrix.
You can access more than one column or more than one row if you like. This will give you a new matrix.
2:3] mat[,
[,1] [,2]
[1,] 5 9
[2,] 6 10
[3,] 7 11
[4,] 8 12
You can subset both rows and columns:
1:2, 2:3] mat[
[,1] [,2]
[1,] 5 9
[2,] 6 10
We can convert matrices into data frames using the function as.data.frame
:
as.data.frame(mat)
V1 V2 V3
1 1 5 9
2 2 6 10
3 3 7 11
4 4 8 12
You can also use single square brackets ([
) to access rows and columns of a data frame:
data("murders")
25, 1] murders[
[1] "Mississippi"
2:3, ] murders[
state abb region population total
2 Alaska AK West 710231 19
3 Arizona AZ West 6392017 232
- Install the
dslabs
package, load the package, and load the US murders dataset.
library(dslabs)
data(murders)
Use the function str
to examine the structure of the murders
object. Which of the following best describes the variables represented in this data frame?
- The 51 states
- The murder rates for all 50 states and DC.
- The state name, the abbreviation of the state name, the state’s region, and the state’s population and total number of murders for 2010.
str
shows no relevant information.
What are the column names used by the data frame for these five variables?
Use the accessor
$
to extract the state abbreviations and assign them to the objecta
. What is the class of this object?Now use the square brackets to extract the state abbreviations and assign them to the object
b
. Use theidentical
function to determine ifa
andb
are the same.We saw that the
region
column stores a factor. You can corroborate this by typing:
class(murders$region)
With one line of code, use the function levels
and length
to determine the number of regions defined by this dataset.
- The function
table
takes a vector and returns the frequency of each element. You can quickly see how many states are in each region by applying this function. Use this function in one line of code to create a table of states per region.
Sequences
Another useful function for creating vectors generates sequences:
seq(1, 10)
[1] 1 2 3 4 5 6 7 8 9 10
The first argument defines the start, and the second defines the end which is included. The default is to go up in increments of 1, but a third argument lets us tell it how much to jump by:
seq(1, 10, 2)
[1] 1 3 5 7 9
If we want consecutive integers, we can use the following shorthand:
1:10
[1] 1 2 3 4 5 6 7 8 9 10
When we use these functions, R
produces integers, not numerics, because they are typically used to index something:
class(1:10)
[1] "integer"
However, if we create a sequence including non-integers, the class changes:
class(seq(1, 10, 0.5))
[1] "numeric"
Creating Vectors
In R, the most basic objects available to store data are vectors. As we have seen, complex datasets can usually be broken down into components that are vectors. For example, in a data frame, each column is a vector. Here we learn more about this important class.
We can create vectors using the function c
, which stands for concatenate. We use c
to concatenate entries in the following way:
<- c(380, 124, 818)
codes codes
[1] 380 124 818
We can also create character vectors. We use the quotes to denote that the entries are characters rather than variable names.
<- c("italy", "canada", "egypt") country
In R
you can also use single quotes:
<- c('italy', 'canada', 'egypt') country
But be careful not to confuse the single quote ’ with the back quote, which shares a keyboard key with ~.
By now you should know that if you type:
<- c(italy, canada, egypt) country
you receive an error because the variables italy
, canada
, and egypt
are not defined. If we do not use the quotes, R
looks for variables with those names and returns an error.
Names
Sometimes it is useful to name the entries of a vector. For example, when defining a vector of country codes, we can use the names to connect the two:
<- c(italy = 380, canada = 124, egypt = 818)
codes codes
italy canada egypt
380 124 818
The object codes
continues to be a numeric vector:
class(codes)
[1] "numeric"
but with names:
names(codes)
[1] "italy" "canada" "egypt"
If the use of strings without quotes looks confusing, know that you can use the quotes as well:
<- c("italy" = 380, "canada" = 124, "egypt" = 818)
codes codes
italy canada egypt
380 124 818
There is no difference between this function call and the previous one. This is one of the many ways in which R
is quirky compared to other languages.
Subsetting
We use square brackets to access specific elements of a vector. For the vector codes
we defined above, we can access the second element using:
2] codes[
canada
124
You can get more than one entry by using a multi-entry vector as an index:
c(1,3)] codes[
italy egypt
380 818
The sequences defined above are particularly useful if we want to access, say, the first two elements:
1:2] codes[
italy canada
380 124
If the elements have names, we can also access the entries using these names. Below are two examples.
"canada"] codes[
canada
124
c("egypt","italy")] codes[
egypt italy
818 380
Subsetting rows and columns
When we have 2 dimensions (rows and columns in a data.frame or matrix) we can subset on either or both. Since we have two dimensions, we have to have room for two subsets in the square brackets. So, we use a ,
and subset by [row,col]:
For the first five rows and the first two columns of murders
:
1:5,1:2] murders[
state abb
1 Alabama AL
2 Alaska AK
3 Arizona AZ
4 Arkansas AR
5 California CA
For the 3rd row, 4th column:
3,4] murders[
[1] 6392017
And, we can refer to columns by the column names (if we’re working with a data.frame):
1:3,'population'] murders[
[1] 4779736 710231 6392017
We aren’t limited to a sequence like 1:3
either. We can c()
multiple rows or columns. Here, I do both:
c(1,3,51), c('state','abb','population')] murders[
state abb population
1 Alabama AL 4779736
3 Arizona AZ 6392017
51 Wyoming WY 563626
And if we want all the rows or all the columns, we leave the row or column index blank:
10:13,] murders[
state abb region population total
10 Florida FL South 19687653 669
11 Georgia GA South 9920000 376
12 Hawaii HI West 1360301 7
13 Idaho ID West 1567582 12
Coercion
In general, coercion is an attempt by R
to be flexible with data types. When an entry does not match the expected, some of the prebuilt R
functions try to guess what was meant before throwing an error. This can also lead to confusion. Failing to understand coercion can drive programmers crazy when attempting to code in R
since it behaves quite differently from most other languages in this regard. Let’s learn about it with some examples.
We said that vectors must be all of the same type. So if we try to combine, say, numbers and characters, you might expect an error:
<- c(1, "canada", 3) x
But we don’t get one, not even a warning! What happened? Look at x
and its class:
x
[1] "1" "canada" "3"
class(x)
[1] "character"
R coerced the data into characters. It guessed that because you put a character string in the vector, you meant the 1 and 3 to actually be character strings "1"
and “3
”. The fact that not even a warning is issued is an example of how coercion can cause many unnoticed errors in R
.
R also offers functions to change from one type to another. For example, you can turn numbers into characters with:
<- 1:5
x <- as.character(x)
y y
[1] "1" "2" "3" "4" "5"
You can turn it back with as.numeric
:
as.numeric(y)
[1] 1 2 3 4 5
This function is actually quite useful since datasets that include numbers as character strings are common.
Not availables (NA)
This “topic” seems to be wholly unappreciated and it has been our experience that students often panic when encountering an NA
. This often happens when a function tries to coerce one type to another and encounters an impossible case. In such circumstances, R
usually gives us a warning and turns the entry into a special value called an NA
(for “not available”). For example:
<- c("1", "b", "3")
x as.numeric(x)
[1] 1 NA 3
R does not have any guesses for what number you want when you type b
, so it does not try.
While coercion is a common case leading to NA
s, you’ll see them in nearly every real-world dataset. Most often, you will encounter the NA
s as a stand-in for missing data. Again, this a common problem in real-world datasets and you need to be aware that it will come up.
Sorting
Now that we have mastered some basic R
knowledge (ha!), let’s try to gain some insights into the safety of different states in the context of gun murders.
sort
Say we want to rank the states from least to most gun murders. The function sort
sorts a vector in increasing order. We can therefore see the largest number of gun murders by typing:
library(dslabs)
data(murders)
sort(murders$total)
[1] 2 4 5 5 7 8 11 12 12 16 19 21 22 27 32
[16] 36 38 53 63 65 67 84 93 93 97 97 99 111 116 118
[31] 120 135 142 207 219 232 246 250 286 293 310 321 351 364 376
[46] 413 457 517 669 805 1257
However, this does not give us information about which states have which murder totals. For example, we don’t know which state had 1257.
order
The function order
is closer to what we want. It takes a vector as input and returns the vector of indexes that sorts the input vector. This may sound confusing so let’s look at a simple example. We can create a vector and sort it:
<- c(31, 4, 15, 92, 65)
x sort(x)
[1] 4 15 31 65 92
Rather than sort the input vector, the function order
returns the index that sorts input vector:
<- order(x)
index x[index]
[1] 4 15 31 65 92
This is the same output as that returned by sort(x)
. If we look at this index, we see why it works:
x
[1] 31 4 15 92 65
order(x)
[1] 2 3 1 5 4
The second entry of x
is the smallest, so order(x)
starts with 2
. The next smallest is the third entry, so the second entry is 3
and so on.
How does this help us order the states by murders? First, remember that the entries of vectors you access with $
follow the same order as the rows in the table. For example, these two vectors containing state names and abbreviations, respectively, are matched by their order:
$state[1:6] murders
[1] "Alabama" "Alaska" "Arizona" "Arkansas" "California"
[6] "Colorado"
$abb[1:6] murders
[1] "AL" "AK" "AZ" "AR" "CA" "CO"
This means we can order the state names by their total murders. We first obtain the index that orders the vectors according to murder totals and then index the state names vector:
<- order(murders$total)
ind $abb[ind] murders
[1] "VT" "ND" "NH" "WY" "HI" "SD" "ME" "ID" "MT" "RI" "AK" "IA" "UT" "WV" "NE"
[16] "OR" "DE" "MN" "KS" "CO" "NM" "NV" "AR" "WA" "CT" "WI" "DC" "OK" "KY" "MA"
[31] "MS" "AL" "IN" "SC" "TN" "AZ" "NJ" "VA" "NC" "MD" "OH" "MO" "LA" "IL" "GA"
[46] "MI" "PA" "NY" "FL" "TX" "CA"
According to the above, California had the most murders.
If we wanted to re-order the whole data.frame based on the murders$total
index, and overwrite with the new order:
= murders[ind,] murders_ordered
This saves murders
in a new data.frame called murders_ordered
that is in the order defined by ind
.
max
and which.max
If we are only interested in the entry with the largest value, we can use max
for the value:
max(murders$total)
[1] 1257
and which.max
for the index of the largest value:
<- which.max(murders$total)
i_max $state[i_max] murders
[1] "California"
For the minimum, we can use min
and which.min
in the same way.
Does this mean California is the most dangerous state? In an upcoming section, we argue that we should be considering rates instead of totals. Before doing that, we introduce one last order-related function: rank
.
rank
Although not as frequently used as order
and sort
, the function rank
is also related to order and can be useful. For any given vector it returns a vector with the rank of the first entry, second entry, etc., of the input vector. Here is a simple example:
<- c(31, 4, 15, 92, 65)
x rank(x)
[1] 3 1 2 5 4
To summarize, let’s look at the results of the three functions we have introduced:
original | sort | order | rank |
---|---|---|---|
31 | 4 | 2 | 3 |
4 | 15 | 3 | 1 |
15 | 31 | 1 | 2 |
92 | 65 | 5 | 5 |
65 | 92 | 4 | 4 |
Beware of recycling
Another common source of unnoticed errors in R
is the use of recycling. We saw that vectors are added elementwise. So if the vectors don’t match in length, it is natural to assume that we should get an error. But we don’t. Notice what happens:
<- c(1,2,3)
x <- c(10, 20, 30, 40, 50, 60, 70)
y +y x
Warning in x + y: longer object length is not a multiple of shorter object
length
[1] 11 22 33 41 52 63 71
We do get a warning, but no error. For the output, R
has recycled the numbers in x
. Notice the last digit of numbers in the output.
TRY IT
For these exercises we will use the US murders dataset. Make sure you load it prior to starting.
library(dslabs)
data("murders")
Use the
$
operator to access the population size data and store it as the objectpop
. Then use thesort
function to redefinepop
so that it is sorted. Finally, use the[
operator to report the smallest population size.Now instead of the smallest population size, find the index of the entry with the smallest population size. Hint: use
order
instead ofsort
.We can actually perform the same operation as in the previous exercise using the function
which.min
. Write one line of code that does this.Now we know how small the smallest state is and we know which row represents it. Which state is it? Define a variable
states
to be the state names from themurders
data frame. Report the name of the state with the smallest population.You can create a data frame using the
data.frame
function. Here is a quick example:
<- c(35, 88, 42, 84, 81, 30)
temp <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",
city "San Juan", "Toronto")
<- data.frame(name = city, temperature = temp) city_temps
Use the rank
function to determine the population rank of each state from smallest population size to biggest. Save these ranks in an object called ranks
, then create a data frame with the state name and its rank. Call the data frame my_df
.
Repeat the previous exercise, but this time order
my_df
so that the states are ordered from least populous to most populous. Hint: create an objectind
that stores the indexes needed to order the population values. Then use the bracket operator[
to re-order each column in the data frame.The
na_example
vector represents a series of counts. You can quickly examine the object using:
data("na_example")
str(na_example)
int [1:1000] 2 1 3 2 1 3 1 4 3 2 ...
However, when we compute the average with the function mean
, we obtain an NA
:
mean(na_example)
[1] NA
The is.na
function returns a logical vector that tells us which entries are NA
. Assign this logical vector to an object called ind
and determine how many NA
s does na_example
have. Note that TRUE=1 and FALSE=0 when “coerced”.
Vector arithmetics
California had the most murders, but does this mean it is the most dangerous state? What if it just has many more people than any other state? We can quickly confirm that California indeed has the largest population:
library(dslabs)
data("murders")
$state[which.max(murders$population)] murders
[1] "California"
with over 37 million inhabitants. It is therefore unfair to compare the totals if we are interested in learning how safe the state is. What we really should be computing is the murders per capita. The reports we describe in the motivating section used murders per 100,000 as the unit. To compute this quantity, the powerful vector arithmetic capabilities of R
come in handy.
Rescaling a vector
In R, arithmetic operations on vectors occur element-wise. For a quick example, suppose we have height in inches:
<- c(69, 62, 66, 70, 70, 73, 67, 73, 67, 70) inches
and want to convert to centimeters. Notice what happens when we multiply inches
by 2.54:
* 2.54 inches
[1] 175.26 157.48 167.64 177.80 177.80 185.42 170.18 185.42 170.18 177.80
In the line above, we multiplied each element by 2.54. Similarly, if for each entry we want to compute how many inches taller or shorter than 69 inches, the average height for males, we can subtract it from every entry like this:
- 69 inches
[1] 0 -7 -3 1 1 4 -2 4 -2 1
Two vectors
If we have two vectors of the same length, and we sum them in R, they will be added entry by entry as follows:
The same holds for other mathematical operations, such as -
, *
and /
.
This implies that to compute the murder rates we can simply type:
<- murders$total / murders$population * 100000 murder_rate
Once we do this, we notice that California is no longer near the top of the list. In fact, we can use what we have learned to order the states by murder rate:
$abb[order(murder_rate)] murders
[1] "VT" "NH" "HI" "ND" "IA" "ID" "UT" "ME" "WY" "OR" "SD" "MN" "MT" "CO" "WA"
[16] "WV" "RI" "WI" "NE" "MA" "IN" "KS" "NY" "KY" "AK" "OH" "CT" "NJ" "AL" "IL"
[31] "OK" "NC" "NV" "VA" "AR" "TX" "NM" "CA" "FL" "TN" "PA" "AZ" "GA" "MS" "MI"
[46] "DE" "SC" "MD" "MO" "LA" "DC"
Right now, the murder_rate
object isn’t in the murders
data.frame, but we know it’s the right length (why?). So we can add it:
$rate = murder_rate murders
Note that now, we have two copies of the same vector of numbers – one called murder_rate
floatin’ around in our environment, and another in our murders
data.frame with the column name rate
. If we re-order murder_rate
, it won’t affect anything in murders$rate
and vice versa.
- Previously we created this data frame:
<- c(35, 88, 42, 84, 81, 30)
temp <- c("Beijing", "Lagos", "Paris", "Rio de Janeiro",
city "San Juan", "Toronto")
<- data.frame(name = city, temperature = temp) city_temps
Remake the data frame using the code above, but add a line that converts the temperature from Fahrenheit to Celsius. The conversion is
Write code to compute the following sum
? Hint: thanks to Euler, we know it should be close to .Compute the per 100,000 murder rate for each state and store it in a new column called
murder_rate
. Then compute the average murder rate for the US using the functionmean
. What is the average?
Indexing
Indexing is a boring name for an important tool. R
provides a powerful and convenient way of referencing specific elements of vectors. We can, for example, subset a vector based on properties of another vector. In this section, we continue working with our US murders example from before.
Subsetting with logicals
Imagine you are moving from Italy where, according to an ABC news report, the murder rate is only 0.71 per 100,000. You would prefer to move to a state with a similar murder rate. Another powerful feature of R
is that we can use logicals to index vectors. If we compare a vector to a single number, it actually performs the test for each entry. The following is an example related to the question above:
<- murder_rate < 0.71 ind
If we instead want to know if a value is less or equal, we can use:
<- murder_rate <= 0.71 ind
Note that we get back a logical vector with TRUE
for each entry smaller than or equal to 0.71. To see which states these are, we can leverage the fact that vectors can be indexed with logicals.
$state[ind] murders
[1] "Hawaii" "Iowa" "New Hampshire" "North Dakota"
[5] "Vermont"
In order to count how many are TRUE, the function sum
returns the sum of the entries of a vector and logical vectors get coerced to numeric with TRUE
coded as 1 and FALSE
as 0. Thus we can count the states using:
sum(ind)
[1] 5
Since ind
has the same length as all of the columns in murders
, it can be used as a row index. When used as a row index, it will return all the rows for which the condition was true. If we use this, leaving the column index blank (for all columns):
murders[ind,]
state abb region population total rate
12 Hawaii HI West 1360301 7 0.5145920
16 Iowa IA North Central 3046355 21 0.6893484
30 New Hampshire NH Northeast 1316470 5 0.3798036
35 North Dakota ND North Central 672591 4 0.5947151
46 Vermont VT Northeast 625741 2 0.3196211
Logical operators
Suppose we like the mountains and we want to move to a safe state in the western region of the country. We want the murder rate to be at most 1. In this case, we want two different things to be true. Here we can use the logical operator and, which in R
is represented with &
. This operation results in TRUE
only when both logicals are TRUE
. To see this, consider this example:
TRUE & TRUE
[1] TRUE
TRUE & FALSE
[1] FALSE
FALSE & FALSE
[1] FALSE
For our example, we can form two logicals:
<- murders$region == "West"
west <- murder_rate <= 1 safe
and we can use the &
to get a vector of logicals that tells us which states satisfy both conditions:
<- safe & west
ind $state[ind] murders
[1] "Hawaii" "Idaho" "Oregon" "Utah" "Wyoming"
which
Suppose we want to look up California’s murder rate. For this type of operation, it is convenient to convert vectors of logicals into indexes instead of keeping long vectors of logicals. The function which
tells us which entries of a logical vector are TRUE. So we can type:
<- which(murders$state == "California")
ind $rate[ind] murders
[1] 3.374138
%in%
If rather than an index we want a logical that tells us whether or not each element of a first vector is in a second, we can use the function %in%
. Let’s imagine you are not sure if Boston, Dakota, and Washington are states. You can find out like this:
c("Boston", "Dakota", "Washington") %in% murders$state
[1] FALSE FALSE TRUE
Note that we will be using %in%
often throughout the course
Start by loading the library and data.
library(dslabs)
data(murders)
Note that every time you run this, you replace murder
in your environment. So if you had created the murders$rate
column, it’s gone. But if you created the free-floating object murder_rate
, that still exists (but can be overwritten).
Compute the per 100,000 murder rate for each state and store it in an object called
murder_rate
. Then use logical operators to create a logical vector namedlow
that tells us which entries ofmurder_rate
are lower than 1.Now use the results from the previous exercise and the function
which
to determine the indices ofmurder_rate
associated with values lower than 1.Use the results from the previous exercise to report the names of the states with murder rates lower than 1.
Now extend the code from exercises 2 and 3 to report the states in the Northeast with murder rates lower than 1. Hint: use the previously defined logical vector
low
and the logical operator&
.In a previous exercise we computed the murder rate for each state and the average of these numbers. How many states are below the average?
Use the match function to identify the states with abbreviations AK, MI, and IA. Hint: start by defining an index of the entries of
murders$abb
that match the three abbreviations, then use the[
operator to extract the states.Use the
%in%
operator to create a logical vector that answers the question: which of the following are actual abbreviations: MA, ME, MI, MO, MU?Extend the code you used in exercise 7 to report the one entry that is not an actual abbreviation. Hint: use the
!
operator, which turnsFALSE
intoTRUE
and vice versa, thenwhich
to obtain an index.
Further help with R
If you are not comfortable with R
, the earlier you seek out help, the better. Quietly letting the course pass by you because you don’t know how to fix an error will do nobody any good. Attend TA office hours or attend TA or Prof. Bushong’s office hours see Syllabus for times and Zoom links. Also, join the course Slack (see the front page of our course website for a link) and post questions there.
Finally, there are also primers on Rstudio.cloud that can be useful. There are many ways we can help you get used to R
, but only if you reach out.
Footnotes
Comments from previous classes indicate that I am not, in fact, funny.↩︎
This is, without a doubt, my least favorite aspect of
R
. I’d even venture to call it stupid. The logic behind this pesky<-
is a total mystery to me, but there is logic to avoiding=
. But, you do you.↩︎This equals sign is the reasons we assign values with
<-
; then when arguments of a function are assigned values, we don’t end up with multiple equals signs. But… who cares.↩︎Whether you view this as a feature or a bug is a good indicator whether you’ll enjoy working with
R
.↩︎