13: Text as Data

Due Date

This assignment is due on Monday, April 21st

All assignments are due on D2L by 11:59pm on the due date. Late work is not accepted. You do not need to submit your .rmd file - just the properly-knitted PDF. All assignments must be properly rendered to PDF using Latex. Make sure you start your assignment sufficiently early such that you have time to address rendering issues. Come to office hours or use the course Slack if you have issues. Using an Rstudio instance on posit.cloud is always a feasible alternative.

For this lab, you will need to make sure you set echo = T in the knitr options that are set in your template’s first chunk. Before turning your assignment in, check to make sure your code is showing. Labs that do not show all code will earn a zero.

This lab is deceptively short. Regular expressions can take a while to learn, and can be very frustrating. Leave yourself ample time.

Exercise 1 of 1

Create the following Sales2020 object. We want to convert the sales into a numeric format, but have to wrestle with the extra characters. Using no more than two lines of code, convert these to numeric.

Sales2020 = c('$1,420,142',
              '$438,125.82',
              '120,223.50',
              '42,140')

We have the following student names:

Students = c('Ali',' Meza','McAvoy', 'Mc Evoy', '.Donaldson','Kirkpatrick ')

We would like to clean these student names and place them in alphabetical order. However, we note that “Mc Evoy” is a different name from “McEvoy” (the Irish are particular about that), so we need no more than two lines of code that can print these names in order (use order to place the cleaned names in order).

We want to check if each of these are valid, complete (including two decimal places for cents) prices. Write a line of code that returns TRUE if the price is in US dollars and includes cents to two digits.

Prices = c('$12.95',
           '$\beta$',
           '$1944.55',
           '3.14',
           '$CAN',
           '$12.',
           '$109',
           '4,05',
           '$200.00')

We will use groups to extract two columns of information from the following text we scraped from a land sales website:

landSales = c('Sold 12 acres at $105 per acre',
              '200 ac. at $58.90 each',
              '.25 acre lot for $1,000.00 ea',
              'Offered 50 acres for $5,000 per')

We want to calculate the total amount spent on each of these transactions (which are all in “per acre” prices). In no more than 5 lines of code (each pipe is a separate line), extract the acres offered, the price per acre, and then calculate the total transaction (acres x price per acre). You can print the data.frame / tibble, or use %>% knitr::kable() to output the table. Your final output should have four rows and three columns. Hint: use str_match with groups.