<div class="shade_black"  style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;">
<i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>.
</div>

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70 title-slide
count: false

.column.shade_black[.content[

<br>

# .monash-blue.outline-text[ETC1010: Introduction to Data Analysis]

<br>

<h2 style="font-weight:900!important;">Week of Tidy Data + Style</h2>

.bottom_abs.width100[

Lecturer: *Nicholas Tierney*

Department of Econometrics and Business Statistics

<span><i class="fas  fa-envelope faa-float animated "></i></span>  ETC1010.Clayton-x@monash.edu

11th Mar 2020

<br>
]

]]

---
class: transition
# Update on how the class is delivered

---
# How the class will now be delivered: Lectorials

- Lectorials are now recorded using Echo360
- **Do not come into class**, listen to the lectorials online and complete the exercises on rstudio cloud or locally.

---
# How the class will now be delivered: Lab/quizzes

These will still be posted weekly, but we will give you an extra day or two to complete them

- Reading quizzes we expect you to complete before the lecture starts
  - So, Reading quiz 2A should be completed prior to lecture 2A
      - These will be closed shortly after lecture 2a starts (With some leeway as we transition into online classes to give you all a chance to get used to things)
      
- Lab quizzes require knowledge from the lecture - these need to be completed after the lecture
  - So, lab quiz 2A should be completed after Lecture 2a
    - Again with the same leeway as for reading quiz 2a above

---
# How the class will now be delivered

## Assessignments

- Assignment 1 will be posted today at the end of class
- Assignments will be submitted online
    - Please get in touch with us (if you haven’t already) if you are a group of 1, or cannot get in touch with your group members.

## Other assessments

- We will update you on this in more detail, but in short, these will be delivered and submitted online

## Consult times

These will now be delivered online via a link to a zoom meeting, or other online video meeting service

---
# There is a lot of change

- There is a lot of change in the air, and things might seem uncertain.
- I am committed to helping you all learn how to do data analysis.
- Thank you all for your patience as we have changed this course. We are dealing with daily updates, and need to change on the fly.
- Perhaps now more than ever it is becoming so very relevant to our daily lives that we understand data, and that we can communicate it to others.
- Remember to get your information from reliable sources, like the [WHO](https://www.who.int/emergencies/diseases/novel-coronavirus-2019/advice-for-public), the [Australian Government](https://www.health.gov.au/news/health-alerts/novel-coronavirus-2019-ncov-health-alert), and see the latest data from [Johns Hopkins](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6).

---

# Practice the most effective strategies we know

1. Wash your hands often, practice good cough & sneeze etiquette.
2. Try to touch your face as little as possible (mouth, nose, and eyes).
4. Practice social distancing (no hugs, kisses, handshakes, high fives)
5. Do not attend concerts, stage plays, sporting events, or any other mass entertainment events.
6. Refrain from visiting museums, exhibitions, movie theaters, night clubs, and other entertainment venues.
7. Stay away from social gatherings and events, (club meetings, religious services, parties)
8. Reduce travel to a minimum. Don't travel long distances if not absolutely necessary.
9. Do not use public transportation if not absolutely necessary.

---
# Social distancing is hard

- How do we know it works? 
- We have data from the last pandemic, the spanish flu.
- Places that practice social distancing vs those who did not had drastically different numbers:

(from [(Hatchett et al, 2007)](https://www.pnas.org/content/104/18/7582))

---
# There is a lot of change

To brighten things up, here are two youtubers I’ve been watching lately to destress and have “COVID19 free time”

- [Lofty Pursuits](https://www.youtube.com/watch?v=FYsqZXHVnI8)
- [SteveMRE1989](https://www.youtube.com/watch?v=hkz6kGQWCHU&t=6s)

---
class: transition
# Your Turn: complete class survey
Available now on Ed, "Getting to know our class"

---
class: transition
# How to learn

I want to take some time to discuss ideas on learning, and how it ties into the course.

---
background-image: url(images/how-to-learn-img-page-1.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-2.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-3.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-4.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-5.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-6.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-7.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-8.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-9.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-10.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-11.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(images/how-to-learn-img-page-12.jpg)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
class: transition
# (demo)

---
class: refresher
# recap

.pull-left[
- Traffic Light System: .green[Green = "good!"] ; .red[Red = "Help!"]
- R + Rstudio
- Tower of babel analogy for writing R code
- Functions are  ___
- columns in data frames are accessed with ___ ?
- packages are installed with ___ ?
- packages are loaded with ___ ?
]

.pull-right[
- Why do we care about Reproducibility?
- Output + input of rmarkdown
- I have an assignment group
- I have made contact with my assignment group
]

---
# The "pipe" operator - `%>%`

The symbol, `%>%` is referred to as the "pipe operator"

What you need to know:

- Read it as "then"
- It passes the output along to the next function

```
data %>%  
  select(age, height, hair_colour) %>% 
  filter(nationality == "australian")
```

"
Use the data, THEN 
select the variables (columns), `age`, `height`, and `hair_colour` THEN
filter so nationality is equal to "australian"
"

That is all you need to know for the moment, but you can read [more here](https://r4ds.had.co.nz/pipes.html)

---
# Problem solving (demo)

Some common questions you can ask yourself when something isn't working:

- Have I got my data?
- Does the thing exist? (Check environment)
- Have I run the code from the top down to where I am now?
- Did none of that work? (Now Restart R)
- Is the column I want there?
- Try using quotes "", or no quotes, or (last resort) backticks

---

# Style guide

> "Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." -- Hadley Wickham

- Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
- There's more to it than what we'll cover today, we'll mention more as we introduce more functionality, and do a recap later in the semester

---
# File names and code chunk labels

- Do not use spaces in file names, use `-` or `_` to separate words
- Use all lowercase letters

```r
# Good
ucb-admit.csv

# Bad
UCB Admit.csv
```

---
# Object names

- Use `_` to separate words in object names
- Use informative but short object names
- Do not reuse object names within an analysis

```r
# Good
acs_employed

# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males
```

---
# Spacing

- Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls. 
- Always put a space after a comma, and never before (just like in regular English).

```r
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```

---
# ggplot

- Always end a line with `+`
- Always indent the next line

```r
# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()

# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()
```

---
# Long lines

- Limit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font.
- Take advantage of RStudio editor's auto formatting for indentation at line breaks.

---
# Assignment

- Use `<-` not `=`

```r
# Good
x <- 2

# Bad
x = 2
```

---
# Quotes

Use `"`, not `'`, for quoting text. The only exception is when the text already contains double quotes and no single quotes.

```r
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  # Good
  labs(title = "`Shine bright like a diamond`",
  # Good
       x = "Diamond prices",
  # Bad
       y = 'Frequency')
```

---
background-image: url(images/allison-horst-dplyr-wrangling.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

.black.large[
Source: Artwork by @allison_horst
]

---
# Overview

.pull-left[
- `filter()`
- `select()`
- `mutate()`
- `arrange()`

]

.pull-right[
- `group_by()`
- `summarise()`
- `count()`
]

---
background-image: url(images/allison-horst-tidyverse-celestial.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, bg-black

.left.white.large[
Artwork by @allison_horst
]

---
class: transition
# R Packages

```r
avail_pkg <- available.packages()
dim(avail_pkg)
## [1] 15367    17
```

As of 2020-03-18 there are 15367 R packages available

---

# Name clashes

```r
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0          ✓ purrr   0.3.3.9000
## ✓ tibble  2.1.3          ✓ dplyr   0.8.5     
## ✓ tidyr   1.0.2          ✓ stringr 1.4.0     
## ✓ readr   1.3.1          ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x purrr::is_null()    masks testthat::is_null()
## x dplyr::lag()        masks stats::lag()
## x dplyr::matches()    masks tidyr::matches(), testthat::matches()
```

---
# Many R packages

- A blessing & a curse! 
- So many packages available, it can make it hard to choose!
- Many of the packages are designed to solve a specific problem
- The tidyverse is designed to work with many other packages following a consistent philosophy
- What this means is that you shouldn't notice it!

???

Extra reading:

We have been loading the `tidyverse` package. Its actually a suite of packages, and you can learn more about the individual packages at https://www.tidyverse.org. You could load each individually.

Because so many people contribute packages to R, it is a blessing and a curse.

???

The best techniques are available, but there can be conflicts between function names. When you load tidyverse it prints a great summary of conflicts that it knows about, between its functions and others.

For example, there is a `filter` function in the `stats` package that comes with the R distribution. This can cause confusion when you want to use the filter function in `dplyr` (part of tidyverse). To be sure the function you use is the one you want to use, you can prefix it with the package name, `dplyr::filter()`.
---
class: transition

# Let's talk about data

---
background-image: url(images/french_fries.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

???

This was an actual experiment in Food Sciences at Iowa State University. The goal was to find out if some cheaper oil options could be used to make hot chips: that people would not be able to distinguish the difference between chips fried in the new oils relative to those fried in the current market leader.

Twelve tasters were recruited to sample two chips from each batch, over a period of ten weeks. The same oil was kept for a period of 10 weeks! May be a bit gross by the end!

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data, and the evolution of the `tidyverse` tools.

---

# Example: french fries

- Experiment in Food Sciences at Iowa State University. 
- Aim: find if cheaper oil could be used to make hot chips
- Question: Can people distinguish between chips fried in the new oils relative to those current market leader oil.
- 12 tasters recruited 
- Each sampled two chips from each batch
- Over a period of ten weeks.

Same oil kept for a period of 10 weeks! May be a bit gross!

---
# Example: french-fries - pivoting into long form

```r
french_fries <- read_csv("data/french_fries.csv")
french_fries
```

```
## # A tibble: 6 x 9
##    time treatment subject   rep potato buttery grassy rancid painty
##   <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1     1         1       3     1    2.9     0      0      0      5.5
## 2     1         1       3     2   14       0      0      1.1    0  
## 3     1         1      10     1   11       6.4    0      0      0  
## 4     1         1      10     2    9.9     5.9    2.9    2.2    0  
## 5     1         1      15     1    1.2     0.1    0      1.1    5.1
## 6     1         1      15     2    8.8     3      3.6    1.5    2.3
```

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data and the tidyverse set of tools

---
# Example: french-fries - pivoting into long form

.pull-left[

```r
fries_long <- french_fries %>% 
  pivot_longer(cols = potato:painty,
               names_to = "type", 
               values_to = "rating")
fries_long
```
]

.pull-right[

```
## # A tibble: 3,480 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 3,470 more rows
```
]

---
# Example: french-fries - pivoting back

.pull-left[

```r
fries_long
## # A tibble: 3,480 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 3,470 more rows
```
]

.pull-right[

```r
fries_long %>% 
  pivot_wider(names_from = type,
              values_from = rating)
## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows
```
]

---
class: transition
# `filter()`

choose observations from your data

---
# `filter()`: example

```r
fries_long %>%
  filter(subject == 10)
## # A tibble: 300 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1      10     1 potato    11  
##  2     1         1      10     1 buttery    6.4
##  3     1         1      10     1 grassy     0  
##  4     1         1      10     1 rancid     0  
##  5     1         1      10     1 painty     0  
##  6     1         1      10     2 potato     9.9
##  7     1         1      10     2 buttery    5.9
##  8     1         1      10     2 grassy     2.9
##  9     1         1      10     2 rancid     2.2
## 10     1         1      10     2 painty     0  
## # … with 290 more rows
```

---
# `filter()`: details

Filtering requires comparison to find the subset of observations of interest.  What do you think the following mean?

- `subject != 10` 
- `x > 10` 
- `x >= 10` 
- `class %in% c("A", "B")` 
- `!is.na(y)`

---
# `filter()`: details

`subject != 10`

Find rows corresponding to all subjects except subject 10

`x > 10`

find all rows where variable `x` has values bigger than 10

`x >= 10`

finds all rows variable `x` is greater than or equal to 10.

`class %in% c("A", "B")`

finds all rows where variable `class` is either A or B

`!is.na(y)`

finds all rows that *DO NOT* have a missing value for variable `y`

---
# Your turn: open french-fries.Rmd

Filter the french fries data to have:

- only week 1
- oil type 1 (oil type is called treatment)
- oil types 1 and 3 but not 2
- weeks 1-4 only

---
# French Fries Filter: only week 1

```r
fries_long %>% filter(time == 1)
## # A tibble: 360 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 350 more rows
```

---
# French Fries Filter: oil type 1

```r
fries_long %>% filter(treatment == 1)
## # A tibble: 1,160 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 1,150 more rows
```

---
# French Fries Filter: oil types 1 and 3 but not 2

```r
fries_long %>% filter(treatment != 2)
## # A tibble: 2,320 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 2,310 more rows
```

---
# French Fries Filter: weeks 1-4 only

```r
fries_long %>% filter(time %in% c("1", "2", "3", "4"))
## # A tibble: 1,440 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 1,430 more rows
```

---
class: transition

# about  `%in%`

[demo]

---
# `select()`

- Chooses which variables to keep in the data set. 
- Useful when there are many variables but you only need some of them for an analysis.

---
# `select()`: a comma separated list of variables, by name.

```r
french_fries %>% 
  select(time, 
         treatment, 
         subject)
## # A tibble: 696 x 3
##     time treatment subject
##    <dbl>     <dbl>   <dbl>
##  1     1         1       3
##  2     1         1       3
##  3     1         1      10
##  4     1         1      10
##  5     1         1      15
##  6     1         1      15
##  7     1         1      16
##  8     1         1      16
##  9     1         1      19
## 10     1         1      19
## # … with 686 more rows
```

---
# `select()`: **drop** selected variables by prefixing with `-`

```r
french_fries %>% 
  select(-time, 
         -treatment, 
         -subject)
## # A tibble: 696 x 6
##      rep potato buttery grassy rancid painty
##    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1    2.9     0      0      0      5.5
##  2     2   14       0      0      1.1    0  
##  3     1   11       6.4    0      0      0  
##  4     2    9.9     5.9    2.9    2.2    0  
##  5     1    1.2     0.1    0      1.1    5.1
##  6     2    8.8     3      3.6    1.5    2.3
##  7     1    9       2.6    0.4    0.1    0.2
##  8     2    8.2     4.4    0.3    1.4    4  
##  9     1    7       3.2    0      4.9    3.2
## 10     2   13       0      3.1    4.3   10.3
## # … with 686 more rows
```

---
# `select()`

.left-code[
Inside `select()` you can use text-matching of the names like `starts_with()`, `ends_with()`, `contains()`, `matches()`, or `everything()`
]

.right-plot[

```r
french_fries %>% 
  select(contains("e"))
## # A tibble: 696 x 5
##     time treatment subject   rep buttery
##    <dbl>     <dbl>   <dbl> <dbl>   <dbl>
##  1     1         1       3     1     0  
##  2     1         1       3     2     0  
##  3     1         1      10     1     6.4
##  4     1         1      10     2     5.9
##  5     1         1      15     1     0.1
##  6     1         1      15     2     3  
##  7     1         1      16     1     2.6
##  8     1         1      16     2     4.4
##  9     1         1      19     1     3.2
## 10     1         1      19     2     0  
## # … with 686 more rows
```
]

---
# `select()`: Using it

.left-code[
You can use the colon, `:`, to choose variables in order of the columns
]

.right-plot[

```r
french_fries %>% 
  select(time:subject)
## # A tibble: 696 x 3
##     time treatment subject
##    <dbl>     <dbl>   <dbl>
##  1     1         1       3
##  2     1         1       3
##  3     1         1      10
##  4     1         1      10
##  5     1         1      15
##  6     1         1      15
##  7     1         1      16
##  8     1         1      16
##  9     1         1      19
## 10     1         1      19
## # … with 686 more rows
```
]

---
class: transition
# Your turn: back to the french fries data

- `select()` time, treatment and rep
- `select()` subject through to rating
- drop subject

---
background-image: url(images/allison-horst-dplyr-mutate.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

.purple.large.right[
Artwork by @allison_horst
]

---
# `mutate()`: create a new variable; keep existing ones

```r
french_fries 
## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows
```

---
# `mutate()`: create a new variable; keep existing ones

```r
french_fries %>% 
* mutate(rainty = rancid + painty)
## # A tibble: 696 x 10
##     time treatment subject   rep potato buttery grassy rancid painty rainty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5   5.5 
##  2     1         1       3     2   14       0      0      1.1    0     1.1 
##  3     1         1      10     1   11       6.4    0      0      0     0   
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0     2.2 
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1   6.20
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3   3.8 
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2   0.3 
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4     5.4 
##  9     1         1      19     1    7       3.2    0      4.9    3.2   8.1 
## 10     1         1      19     2   13       0      3.1    4.3   10.3  14.6 
## # … with 686 more rows
```

---
class: transition

# Your turn: french fries

Compute a new variable called `lrating` by taking a log of the rating

---
# `summarise()`: boil data down to one row observation

```r
fries_long
```

```
## # A tibble: 6 x 6
##    time treatment subject   rep type    rating
##   <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
## 1     1         1       3     1 potato     2.9
## 2     1         1       3     1 buttery    0  
## 3     1         1       3     1 grassy     0  
## 4     1         1       3     1 rancid     0  
## 5     1         1       3     1 painty     5.5
## 6     1         1       3     2 potato    14
```

```r
fries_long %>% 
  summarise(rating = mean(rating, na.rm = TRUE))
## # A tibble: 1 x 1
##   rating
##    <dbl>
## 1   3.16
```

---
class: transition
# What if we want a summary for each `type`?

use `group_by()`

---
# Using `summarise()` + `group_by()`

Produce summaries for every group:

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 buttery  1.82 
## 2 grassy   0.664
## 3 painty   2.52 
## 4 potato   6.95 
## 5 rancid   3.85
```

---
class: transition
# Your turn: Back to french-fries.Rmd

- Compute the average rating by subject
- Compute the average rancid rating per week

---
# french fries answers

```r
fries_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 12 x 2
##    subject rating
##      <dbl>  <dbl>
##  1       3   2.46
##  2      10   4.24
##  3      15   2.16
##  4      16   3.00
##  5      19   4.54
##  6      31   4.00
##  7      51   4.39
##  8      52   2.72
##  9      63   3.48
## 10      78   1.94
## 11      79   1.94
## 12      86   2.94
```

---
# french fries answers

```r
fries_long %>% 
  filter(type == "rancid") %>%
  group_by(time) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 10 x 2
##     time rating
##    <dbl>  <dbl>
##  1     1   2.36
##  2     2   2.85
##  3     3   3.72
##  4     4   3.60
##  5     5   3.53
##  6     6   4.08
##  7     7   3.89
##  8     8   4.27
##  9     9   4.67
## 10    10   6.07
```

---
# `arrange()`: orders data by a given variable.

Useful for display of results (but there are other uses!)

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) 
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 buttery  1.82 
## 2 grassy   0.664
## 3 painty   2.52 
## 4 potato   6.95 
## 5 rancid   3.85
```

---
# `arrange()`

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 grassy   0.664
## 2 buttery  1.82 
## 3 painty   2.52 
## 4 rancid   3.85 
## 5 potato   6.95
```

---
class: transition
# Your turn: french-fries.Rmd - arrange

- Arrange the average rating by type in decreasing order
- Arrange the average subject rating in order lowest to highest.

---
# `arrange()` answers

```r
fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(desc(rating))
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 potato   6.95 
## 2 rancid   3.85 
## 3 painty   2.52 
## 4 buttery  1.82 
## 5 grassy   0.664
```

---
# `arrange()` answers

```r
fries_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
## # A tibble: 12 x 2
##    subject rating
##      <dbl>  <dbl>
##  1      78   1.94
##  2      79   1.94
##  3      15   2.16
##  4       3   2.46
##  5      52   2.72
##  6      86   2.94
##  7      16   3.00
##  8      63   3.48
##  9      31   4.00
## 10      10   4.24
## 11      51   4.39
## 12      19   4.54
```

---
# `count()` the number of things in a given column

```r
fries_long %>% 
  count(type, sort = TRUE)
## # A tibble: 5 x 2
##   type        n
##   <chr>   <int>
## 1 buttery   696
## 2 grassy    696
## 3 painty    696
## 4 potato    696
## 5 rancid    696
```

---
class: transition left
# Your turn: `count()`

- count the number of subjects
- count the number of types

---
class: transition
# French Fries: Putting it together to problem solve

---
# French Fries: Are ratings similar?

.pull-left[

```r
fries_long %>% 
  group_by(type) %>%
  summarise(
    m = mean(rating, 
             na.rm = TRUE), 
    sd = sd(rating, 
            na.rm = TRUE)) %>%
  arrange(-m)
## # A tibble: 5 x 3
##   type        m    sd
##   <chr>   <dbl> <dbl>
## 1 potato  6.95   3.58
## 2 rancid  3.85   3.78
## 3 painty  2.52   3.39
## 4 buttery 1.82   2.41
## 5 grassy  0.664  1.32
```
]

.pull-right[

The scales of the ratings are quite different. Mostly the chips are rated highly on potato'y, but low on grassy.

]

---
# French Fries: Are ratings similar?

```r
ggplot(fries_long,
       aes(x = type, 
           y = rating)) +
  geom_boxplot()
```

---
# French Fries: Are reps like each other?

```r
fries_spread <- fries_long %>% 
  pivot_wider(names_from = rep, 
              values_from = rating)
  
fries_spread
## # A tibble: 1,740 x 6
##     time treatment subject type      `1`   `2`
##    <dbl>     <dbl>   <dbl> <chr>   <dbl> <dbl>
##  1     1         1       3 potato    2.9  14  
##  2     1         1       3 buttery   0     0  
##  3     1         1       3 grassy    0     0  
##  4     1         1       3 rancid    0     1.1
##  5     1         1       3 painty    5.5   0  
##  6     1         1      10 potato   11     9.9
##  7     1         1      10 buttery   6.4   5.9
##  8     1         1      10 grassy    0     2.9
##  9     1         1      10 rancid    0     2.2
## 10     1         1      10 painty    0     0  
## # … with 1,730 more rows
```

---
# French Fries: Are reps like each other?

```r
summarise(fries_spread,
          r = cor(`1`, `2`, use = "complete.obs"))
## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.668
```

---
# French Fries:

```r
  ggplot(fries_spread,
         aes(x = `1`, 
             y = `2`)) + 
  geom_point() + 
  labs(title = "Data is poor quality: the replicates do not look like each other!")
```

---
# French Fries: Replicates by rating type

```r
fries_spread %>%
  group_by(type) %>%
  summarise(r = cor(x = `1`, 
                    y = `2`, 
                    use = "complete.obs"))
## # A tibble: 5 x 2
##   type        r
##   <chr>   <dbl>
## 1 buttery 0.650
## 2 grassy  0.239
## 3 painty  0.479
## 4 potato  0.616
## 5 rancid  0.391
```

---

# French Fries: Replicates by rating type

```r
ggplot(fries_spread, aes(x=`1`, y=`2`)) + 
  geom_point() + facet_wrap(~type, ncol = 5)
```

Potato'y and buttery have better replication than the other scales, but there is still a lot of variation from rep 1 to 2.

---
class: transition

# When to use quotes? `"` `'`, nothing, or backtick?

---

# When to use quotes? `"` `'`, nothing, or backtick?

- Use no quotes (bare variable names) when the variable exists
- Otherwise use strings

Example:
```
fries_long %>% 
  pivot_wider(names_from = type,
              values_from = rating)
```

```
french_fries %>% 
  pivot_longer(cols = potato:painty,
               names_to = "type", 
               values_to = "rating")
```

---

# When to use quotes? `"` `'`, nothing, or backtick?

Variables with unusual names  (starting with numbers, spaces, or containing special characters like `!@#$%^&*()-` need to be referenced with backticks:

```
data %>% select(`name with spaces`)
```

---
<iframe width="1040" height="650" src="https://www.youtube.com/embed/i4RGqzaNEtg" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>

---
# Lab exercise: Exploring data PISA data

Open `pisa.Rmd` on rstudio cloud.

---
# Assignment 1

**It will be launched later today**
- Instructions to appear on ED and the course website

**When is the assignment due?**

- 1st April, 2020 5pm

**How do I complete the assignment?**

- You should complete as much of the assignment as you can by yourself
- Then once you have done as much as you can, work with your group to

**I don't have a group / I can't get in contact with my group**

- If you don't have a group, make sure you have filled in this form [here](https://forms.gle/XJGSByKDXgKXuxZb9) (it has also been posted on ED)
- I will assign everyone into a group who has filled in the form

---
# Assignment 1

**How do I stay in touch with my group?**

- Get in touch with your group and decide how you will work together
  - you can use zoom through Monash to create video/audio group calls
  - you could create a Slack team
  - You can communicate via email, WhatsApp, Messenger, whatever you all agree on

**How do I submit the assignment?**

- You submit the assignment via ED - instructions to follow

---
class: transition
# Lab Quiz

Time to take the lab quiz.

Notes for current slide

Notes for next slide

These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for PDF .

Press the right arrow to progress to the next slide!

1/96

ETC1010: Introduction to Data Analysis

Week 2, part B

Week of Tidy Data + Style

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

11th Mar 2020

1/96

Update on how the class is delivered2/96

How the class will now be delivered: LectorialsLectorials are now recorded using Echo360
Do not come into class, listen to the lectorials online and complete the exercises on rstudio cloud or locally.
3/96

How the class will now be delivered: Lab/quizzes

These will still be posted weekly, but we will give you an extra day or two to complete them

Reading quizzes we expect you to complete before the lecture starts
- So, Reading quiz 2A should be completed prior to lecture 2A
  - These will be closed shortly after lecture 2a starts (With some leeway as we transition into online classes to give you all a chance to get used to things)
Lab quizzes require knowledge from the lecture - these need to be completed after the lecture
- So, lab quiz 2A should be completed after Lecture 2a
- Again with the same leeway as for reading quiz 2a above

4/96

How the class will now be delivered

Assessignments

Assignment 1 will be posted today at the end of class
Assignments will be submitted online
- Please get in touch with us (if you haven’t already) if you are a group of 1, or cannot get in touch with your group members.

Other assessments

We will update you on this in more detail, but in short, these will be delivered and submitted online

Consult times

These will now be delivered online via a link to a zoom meeting, or other online video meeting service

5/96

There is a lot of change

There is a lot of change in the air, and things might seem uncertain.
I am committed to helping you all learn how to do data analysis.
Thank you all for your patience as we have changed this course. We are dealing with daily updates, and need to change on the fly.
Perhaps now more than ever it is becoming so very relevant to our daily lives that we understand data, and that we can communicate it to others.
Remember to get your information from reliable sources, like the WHO, the Australian Government, and see the latest data from Johns Hopkins.

6/96

Practice the most effective strategies we knowWash your hands often, practice good cough & sneeze etiquette.
Try to touch your face as little as possible (mouth, nose, and eyes).
Practice social distancing (no hugs, kisses, handshakes, high fives)
Do not attend concerts, stage plays, sporting events, or any other mass entertainment events.
Refrain from visiting museums, exhibitions, movie theaters, night clubs, and other entertainment venues.
Stay away from social gatherings and events, (club meetings, religious services, parties)
Reduce travel to a minimum. Don't travel long distances if not absolutely necessary.
Do not use public transportation if not absolutely necessary.
7/96

How do we know it works?
We have data from the last pandemic, the spanish flu.
Places that practice social distancing vs those who did not had drastically different numbers:

(from (Hatchett et al, 2007))

8/96

There is a lot of change

To brighten things up, here are two youtubers I’ve been watching lately to destress and have “COVID19 free time”

9/96

Your Turn: complete class survey

Available now on Ed, "Getting to know our class"

10/96

How to learn

I want to take some time to discuss ideas on learning, and how it ties into the course.

11/96

12/96

13/96

14/96

15/96

16/96

17/96

18/96

19/96

20/96

21/96

22/96

23/96

(demo)24/96

recapTraffic Light System: Green = "good!" ; Red = "Help!"
R + Rstudio
Tower of babel analogy for writing R code
Functions are  _
columns in data frames are accessed with _ ?
packages are installed with _ ?
packages are loaded with _ ?

Why do we care about Reproducibility?
Output + input of rmarkdown
I have an assignment group
I have made contact with my assignment group

25/96

The "pipe" operator - `%>%`

The symbol, %>% is referred to as the "pipe operator"

What you need to know:

Read it as "then"
It passes the output along to the next function

data %>%  
  select(age, height, hair_colour) %>% 
  filter(nationality == "australian")

" Use the data, THEN select the variables (columns), age, height, and hair_colour THEN filter so nationality is equal to "australian" "

26/96

The "pipe" operator - `%>%`

The symbol, %>% is referred to as the "pipe operator"

What you need to know:

Read it as "then"
It passes the output along to the next function

data %>%  
  select(age, height, hair_colour) %>% 
  filter(nationality == "australian")

" Use the data, THEN select the variables (columns), age, height, and hair_colour THEN filter so nationality is equal to "australian" "

That is all you need to know for the moment, but you can read more here

26/96

Problem solving (demo)

Some common questions you can ask yourself when something isn't working:

Have I got my data?
Does the thing exist? (Check environment)
Have I run the code from the top down to where I am now?
Did none of that work? (Now Restart R)
Is the column I want there?
Try using quotes "", or no quotes, or (last resort) backticks

27/96

Style guide

"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." -- Hadley Wickham

Style guide for this course is based on the Tidyverse style guide: http://style.tidyverse.org/
There's more to it than what we'll cover today, we'll mention more as we introduce more functionality, and do a recap later in the semester

28/96

File names and code chunk labels

Do not use spaces in file names, use - or _ to separate words
Use all lowercase letters

# Good
ucb-admit.csv
# Bad
UCB Admit.csv

29/96

Object names

Use _ to separate words in object names
Use informative but short object names
Do not reuse object names within an analysis

# Good
acs_employed
# Bad
acs.employed
acs2
acs_subset
acs_subsetted_for_males

30/96

Spacing

Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls.
Always put a space after a comma, and never before (just like in regular English).

# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)
# Bad
average<-mean(feet/12+inches,na.rm=TRUE)

31/96

ggplot

Always end a line with +
Always indent the next line

# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()
# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()

32/96

Long linesLimit your code to 80 characters per line. This fits comfortably on a printed page with a reasonably sized font.
Take advantage of RStudio editor's auto formatting for indentation at line breaks.
33/96

Assignment

Use <- not =

# Good
x <- 2
# Bad
x = 2

34/96

Quotes

Use ", not ', for quoting text. The only exception is when the text already contains double quotes and no single quotes.

ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram() +
  # Good
  labs(title = "`Shine bright like a diamond`",
  # Good
       x = "Diamond prices",
  # Bad
       y = 'Frequency')

35/96

Source: Artwork by @allison_horst

36/96

Overviewfilter()
select()
mutate()
arrange()

group_by()
summarise()
count()

37/96

Artwork by @allison_horst

38/96

R Packages

avail_pkg <- available.packages()
dim(avail_pkg)
## [1] 15367    17

As of 2020-03-18 there are 15367 R packages available

39/96

Name clashes

library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0          ✓ purrr   0.3.3.9000
## ✓ tibble  2.1.3          ✓ dplyr   0.8.5     
## ✓ tidyr   1.0.2          ✓ stringr 1.4.0     
## ✓ readr   1.3.1          ✓ forcats 0.5.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter()     masks stats::filter()
## x dplyr::group_rows() masks kableExtra::group_rows()
## x purrr::is_null()    masks testthat::is_null()
## x dplyr::lag()        masks stats::lag()
## x dplyr::matches()    masks tidyr::matches(), testthat::matches()

40/96

Many R packagesA blessing & a curse! 
So many packages available, it can make it hard to choose!
Many of the packages are designed to solve a specific problem
The tidyverse is designed to work with many other packages following a consistent philosophy
What this means is that you shouldn't notice it!
41/96

For example, there is a filter function in the stats package that comes with the R distribution. This can cause confusion when you want to use the filter function in dplyr (part of tidyverse). To be sure the function you use is the one you want to use, you can prefix it with the package name, dplyr::filter().

Let's talk about data42/96

43/96

Twelve tasters were recruited to sample two chips from each batch, over a period of ten weeks. The same oil was kept for a period of 10 weeks! May be a bit gross by the end!

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data, and the evolution of the tidyverse tools.

Example: french fries

Experiment in Food Sciences at Iowa State University.
Aim: find if cheaper oil could be used to make hot chips
Question: Can people distinguish between chips fried in the new oils relative to those current market leader oil.
12 tasters recruited
Each sampled two chips from each batch
Over a period of ten weeks.

Same oil kept for a period of 10 weeks! May be a bit gross!

44/96

Example: french-fries - pivoting into long form

french_fries <- read_csv("data/french_fries.csv")
french_fries

## # A tibble: 6 x 9
##    time treatment subject   rep potato buttery grassy rancid painty
##   <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1     1         1       3     1    2.9     0      0      0      5.5
## 2     1         1       3     2   14       0      0      1.1    0  
## 3     1         1      10     1   11       6.4    0      0      0  
## 4     1         1      10     2    9.9     5.9    2.9    2.2    0  
## 5     1         1      15     1    1.2     0.1    0      1.1    5.1
## 6     1         1      15     2    8.8     3      3.6    1.5    2.3

45/96

Example: french-fries - pivoting into long form

french_fries <- read_csv("data/french_fries.csv")
french_fries

## # A tibble: 6 x 9
##    time treatment subject   rep potato buttery grassy rancid painty
##   <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
## 1     1         1       3     1    2.9     0      0      0      5.5
## 2     1         1       3     2   14       0      0      1.1    0  
## 3     1         1      10     1   11       6.4    0      0      0  
## 4     1         1      10     2    9.9     5.9    2.9    2.2    0  
## 5     1         1      15     1    1.2     0.1    0      1.1    5.1
## 6     1         1      15     2    8.8     3      3.6    1.5    2.3

This data set was brought to R by Hadley Wickham, and was one of the problems that inspired the thinking about tidy data and the tidyverse set of tools

45/96

Example: french-fries - pivoting into long formfries_long <- french_fries %>% 
  pivot_longer(cols = potato:painty,
               names_to = "type", 
               values_to = "rating")
fries_long

46/96

Example: french-fries - pivoting into long formfries_long <- french_fries %>% 
  pivot_longer(cols = potato:painty,
               names_to = "type", 
               values_to = "rating")
fries_long

## # A tibble: 3,480 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 3,470 more rows
46/96

Example: french-fries - pivoting backfries_long
## # A tibble: 3,480 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 3,470 more rows

fries_long %>% 
  pivot_wider(names_from = type,
              values_from = rating)
## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows

47/96

`filter()`

choose observations from your data

48/96

`filter()`: example

fries_long %>%
  filter(subject == 10)
## # A tibble: 300 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1      10     1 potato    11  
##  2     1         1      10     1 buttery    6.4
##  3     1         1      10     1 grassy     0  
##  4     1         1      10     1 rancid     0  
##  5     1         1      10     1 painty     0  
##  6     1         1      10     2 potato     9.9
##  7     1         1      10     2 buttery    5.9
##  8     1         1      10     2 grassy     2.9
##  9     1         1      10     2 rancid     2.2
## 10     1         1      10     2 painty     0  
## # … with 290 more rows

49/96

`filter()`: details

Filtering requires comparison to find the subset of observations of interest. What do you think the following mean?

subject != 10
x > 10
x >= 10
class %in% c("A", "B")
!is.na(y)

03:00

50/96

`filter()`: details

subject != 10

51/96

`filter()`: details

subject != 10

Find rows corresponding to all subjects except subject 10

51/96

`filter()`: details

subject != 10

Find rows corresponding to all subjects except subject 10

x > 10

51/96

`filter()`: details

subject != 10

Find rows corresponding to all subjects except subject 10

x > 10

find all rows where variable x has values bigger than 10

x >= 10

51/96

`filter()`: details

subject != 10

Find rows corresponding to all subjects except subject 10

x > 10

find all rows where variable x has values bigger than 10

x >= 10

finds all rows variable x is greater than or equal to 10.

class %in% c("A", "B")

51/96

`filter()`: details

subject != 10

Find rows corresponding to all subjects except subject 10

x > 10

find all rows where variable x has values bigger than 10

x >= 10

finds all rows variable x is greater than or equal to 10.

class %in% c("A", "B")

finds all rows where variable class is either A or B

!is.na(y)

51/96

`filter()`: details

subject != 10

Find rows corresponding to all subjects except subject 10

x > 10

find all rows where variable x has values bigger than 10

x >= 10

finds all rows variable x is greater than or equal to 10.

class %in% c("A", "B")

finds all rows where variable class is either A or B

!is.na(y)

finds all rows that DO NOT have a missing value for variable y

51/96

Your turn: open french-fries.Rmd

Filter the french fries data to have:

only week 1
oil type 1 (oil type is called treatment)
oil types 1 and 3 but not 2
weeks 1-4 only

52/96

French Fries Filter: only week 1

fries_long %>% filter(time == 1)
## # A tibble: 360 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 350 more rows

53/96

French Fries Filter: oil type 1

fries_long %>% filter(treatment == 1)
## # A tibble: 1,160 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 1,150 more rows

54/96

French Fries Filter: oil types 1 and 3 but not 2

fries_long %>% filter(treatment != 2)
## # A tibble: 2,320 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 2,310 more rows

55/96

French Fries Filter: weeks 1-4 only

fries_long %>% filter(time %in% c("1", "2", "3", "4"))
## # A tibble: 1,440 x 6
##     time treatment subject   rep type    rating
##    <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
##  1     1         1       3     1 potato     2.9
##  2     1         1       3     1 buttery    0  
##  3     1         1       3     1 grassy     0  
##  4     1         1       3     1 rancid     0  
##  5     1         1       3     1 painty     5.5
##  6     1         1       3     2 potato    14  
##  7     1         1       3     2 buttery    0  
##  8     1         1       3     2 grassy     0  
##  9     1         1       3     2 rancid     1.1
## 10     1         1       3     2 painty     0  
## # … with 1,430 more rows

56/96

about `%in%`

[demo]

57/96

select()58/96

select()Chooses which variables to keep in the data set. 
Useful when there are many variables but you only need some of them for an analysis. 
58/96

`select()`: a comma separated list of variables, by name.

french_fries %>% 
  select(time, 
         treatment, 
         subject)
## # A tibble: 696 x 3
##     time treatment subject
##    <dbl>     <dbl>   <dbl>
##  1     1         1       3
##  2     1         1       3
##  3     1         1      10
##  4     1         1      10
##  5     1         1      15
##  6     1         1      15
##  7     1         1      16
##  8     1         1      16
##  9     1         1      19
## 10     1         1      19
## # … with 686 more rows

59/96

select(): drop selected variables by prefixing with -60/96

`select()`: drop selected variables by prefixing with `-`

french_fries %>% 
  select(-time, 
         -treatment, 
         -subject)
## # A tibble: 696 x 6
##      rep potato buttery grassy rancid painty
##    <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1    2.9     0      0      0      5.5
##  2     2   14       0      0      1.1    0  
##  3     1   11       6.4    0      0      0  
##  4     2    9.9     5.9    2.9    2.2    0  
##  5     1    1.2     0.1    0      1.1    5.1
##  6     2    8.8     3      3.6    1.5    2.3
##  7     1    9       2.6    0.4    0.1    0.2
##  8     2    8.2     4.4    0.3    1.4    4  
##  9     1    7       3.2    0      4.9    3.2
## 10     2   13       0      3.1    4.3   10.3
## # … with 686 more rows

60/96

`select()`

Inside select() you can use text-matching of the names like starts_with(), ends_with(), contains(), matches(), or everything()

61/96

`select()`

Inside select() you can use text-matching of the names like starts_with(), ends_with(), contains(), matches(), or everything()

french_fries %>% 
  select(contains("e"))
## # A tibble: 696 x 5
##     time treatment subject   rep buttery
##    <dbl>     <dbl>   <dbl> <dbl>   <dbl>
##  1     1         1       3     1     0  
##  2     1         1       3     2     0  
##  3     1         1      10     1     6.4
##  4     1         1      10     2     5.9
##  5     1         1      15     1     0.1
##  6     1         1      15     2     3  
##  7     1         1      16     1     2.6
##  8     1         1      16     2     4.4
##  9     1         1      19     1     3.2
## 10     1         1      19     2     0  
## # … with 686 more rows

61/96

`select()`: Using it

You can use the colon, :, to choose variables in order of the columns

62/96

`select()`: Using it

You can use the colon, :, to choose variables in order of the columns

french_fries %>% 
  select(time:subject)
## # A tibble: 696 x 3
##     time treatment subject
##    <dbl>     <dbl>   <dbl>
##  1     1         1       3
##  2     1         1       3
##  3     1         1      10
##  4     1         1      10
##  5     1         1      15
##  6     1         1      15
##  7     1         1      16
##  8     1         1      16
##  9     1         1      19
## 10     1         1      19
## # … with 686 more rows

62/96

Your turn: back to the french fries dataselect() time, treatment and rep
select() subject through to rating
drop subject

03:00
63/96

Artwork by @allison_horst

64/96

`mutate()`: create a new variable; keep existing ones

french_fries 
## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows

65/96

`mutate()`: create a new variable; keep existing ones

french_fries %>% 
  mutate(rainty = rancid + painty)
## # A tibble: 696 x 10
##     time treatment subject   rep potato buttery grassy rancid painty rainty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5   5.5 
##  2     1         1       3     2   14       0      0      1.1    0     1.1 
##  3     1         1      10     1   11       6.4    0      0      0     0   
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0     2.2 
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1   6.20
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3   3.8 
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2   0.3 
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4     5.4 
##  9     1         1      19     1    7       3.2    0      4.9    3.2   8.1 
## 10     1         1      19     2   13       0      3.1    4.3   10.3  14.6 
## # … with 686 more rows

66/96

Your turn: french fries

Compute a new variable called lrating by taking a log of the rating

02:00

67/96

`summarise()`: boil data down to one row observation

fries_long

## # A tibble: 6 x 6
##    time treatment subject   rep type    rating
##   <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
## 1     1         1       3     1 potato     2.9
## 2     1         1       3     1 buttery    0  
## 3     1         1       3     1 grassy     0  
## 4     1         1       3     1 rancid     0  
## 5     1         1       3     1 painty     5.5
## 6     1         1       3     2 potato    14

68/96

`summarise()`: boil data down to one row observation

fries_long

## # A tibble: 6 x 6
##    time treatment subject   rep type    rating
##   <dbl>     <dbl>   <dbl> <dbl> <chr>    <dbl>
## 1     1         1       3     1 potato     2.9
## 2     1         1       3     1 buttery    0  
## 3     1         1       3     1 grassy     0  
## 4     1         1       3     1 rancid     0  
## 5     1         1       3     1 painty     5.5
## 6     1         1       3     2 potato    14

fries_long %>% 
  summarise(rating = mean(rating, na.rm = TRUE))
## # A tibble: 1 x 1
##   rating
##    <dbl>
## 1   3.16

68/96

What if we want a summary for each type?69/96

What if we want a summary for each `type`?

use group_by()

69/96

Using `summarise()` + `group_by()`

Produce summaries for every group:

fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 buttery  1.82 
## 2 grassy   0.664
## 3 painty   2.52 
## 4 potato   6.95 
## 5 rancid   3.85

70/96

Your turn: Back to french-fries.RmdCompute the average rating by subject
Compute the average rancid rating per week

03:00
71/96

french fries answers

fries_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 12 x 2
##    subject rating
##      <dbl>  <dbl>
##  1       3   2.46
##  2      10   4.24
##  3      15   2.16
##  4      16   3.00
##  5      19   4.54
##  6      31   4.00
##  7      51   4.39
##  8      52   2.72
##  9      63   3.48
## 10      78   1.94
## 11      79   1.94
## 12      86   2.94

72/96

french fries answers

fries_long %>% 
  filter(type == "rancid") %>%
  group_by(time) %>%
  summarise(rating = mean(rating, na.rm=TRUE))
## # A tibble: 10 x 2
##     time rating
##    <dbl>  <dbl>
##  1     1   2.36
##  2     2   2.85
##  3     3   3.72
##  4     4   3.60
##  5     5   3.53
##  6     6   4.08
##  7     7   3.89
##  8     8   4.27
##  9     9   4.67
## 10    10   6.07

73/96

arrange(): orders data by a given variable.74/96

`arrange()`: orders data by a given variable.

Useful for display of results (but there are other uses!)

fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) 
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 buttery  1.82 
## 2 grassy   0.664
## 3 painty   2.52 
## 4 potato   6.95 
## 5 rancid   3.85

74/96

`arrange()`

fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 grassy   0.664
## 2 buttery  1.82 
## 3 painty   2.52 
## 4 rancid   3.85 
## 5 potato   6.95

75/96

Your turn: french-fries.Rmd - arrangeArrange the average rating by type in decreasing order
Arrange the average subject rating in order lowest to highest.

02:00
76/96

`arrange()` answers

fries_long %>% 
  group_by(type) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(desc(rating))
## # A tibble: 5 x 2
##   type    rating
##   <chr>    <dbl>
## 1 potato   6.95 
## 2 rancid   3.85 
## 3 painty   2.52 
## 4 buttery  1.82 
## 5 grassy   0.664

77/96

`arrange()` answers

fries_long %>% 
  group_by(subject) %>%
  summarise(rating = mean(rating, na.rm=TRUE)) %>%
  arrange(rating)
## # A tibble: 12 x 2
##    subject rating
##      <dbl>  <dbl>
##  1      78   1.94
##  2      79   1.94
##  3      15   2.16
##  4       3   2.46
##  5      52   2.72
##  6      86   2.94
##  7      16   3.00
##  8      63   3.48
##  9      31   4.00
## 10      10   4.24
## 11      51   4.39
## 12      19   4.54

78/96

`count()` the number of things in a given column

fries_long %>% 
  count(type, sort = TRUE)
## # A tibble: 5 x 2
##   type        n
##   <chr>   <int>
## 1 buttery   696
## 2 grassy    696
## 3 painty    696
## 4 potato    696
## 5 rancid    696

79/96

Your turn: count()count the number of subjects
count the number of types

02:00
80/96

French Fries: Putting it together to problem solve81/96

French Fries: Are ratings similar?fries_long %>% 
  group_by(type) %>%
  summarise(
    m = mean(rating, 
             na.rm = TRUE), 
    sd = sd(rating, 
            na.rm = TRUE)) %>%
  arrange(-m)
## # A tibble: 5 x 3
##   type        m    sd
##   <chr>   <dbl> <dbl>
## 1 potato  6.95   3.58
## 2 rancid  3.85   3.78
## 3 painty  2.52   3.39
## 4 buttery 1.82   2.41
## 5 grassy  0.664  1.32

82/96

French Fries: Are ratings similar?

fries_long %>% 
  group_by(type) %>%
  summarise(
    m = mean(rating, 
             na.rm = TRUE), 
    sd = sd(rating, 
            na.rm = TRUE)) %>%
  arrange(-m)
## # A tibble: 5 x 3
##   type        m    sd
##   <chr>   <dbl> <dbl>
## 1 potato  6.95   3.58
## 2 rancid  3.85   3.78
## 3 painty  2.52   3.39
## 4 buttery 1.82   2.41
## 5 grassy  0.664  1.32

The scales of the ratings are quite different. Mostly the chips are rated highly on potato'y, but low on grassy.

82/96

French Fries: Are ratings similar?

ggplot(fries_long,
       aes(x = type, 
           y = rating)) +
  geom_boxplot()

83/96

French Fries: Are reps like each other?

fries_spread <- fries_long %>% 
  pivot_wider(names_from = rep, 
              values_from = rating)
fries_spread
## # A tibble: 1,740 x 6
##     time treatment subject type      `1`   `2`
##    <dbl>     <dbl>   <dbl> <chr>   <dbl> <dbl>
##  1     1         1       3 potato    2.9  14  
##  2     1         1       3 buttery   0     0  
##  3     1         1       3 grassy    0     0  
##  4     1         1       3 rancid    0     1.1
##  5     1         1       3 painty    5.5   0  
##  6     1         1      10 potato   11     9.9
##  7     1         1      10 buttery   6.4   5.9
##  8     1         1      10 grassy    0     2.9
##  9     1         1      10 rancid    0     2.2
## 10     1         1      10 painty    0     0  
## # … with 1,730 more rows

84/96

French Fries: Are reps like each other?

summarise(fries_spread,
          r = cor(`1`, `2`, use = "complete.obs"))
## # A tibble: 1 x 1
##       r
##   <dbl>
## 1 0.668

85/96

French Fries:

  ggplot(fries_spread,
         aes(x = `1`, 
             y = `2`)) + 
  geom_point() + 
  labs(title = "Data is poor quality: the replicates do not look like each other!")

86/96

French Fries: Replicates by rating type

fries_spread %>%
  group_by(type) %>%
  summarise(r = cor(x = `1`, 
                    y = `2`, 
                    use = "complete.obs"))
## # A tibble: 5 x 2
##   type        r
##   <chr>   <dbl>
## 1 buttery 0.650
## 2 grassy  0.239
## 3 painty  0.479
## 4 potato  0.616
## 5 rancid  0.391

87/96

French Fries: Replicates by rating type

ggplot(fries_spread, aes(x=`1`, y=`2`)) + 
  geom_point() + facet_wrap(~type, ncol = 5)

88/96

French Fries: Replicates by rating type

ggplot(fries_spread, aes(x=`1`, y=`2`)) + 
  geom_point() + facet_wrap(~type, ncol = 5)

Potato'y and buttery have better replication than the other scales, but there is still a lot of variation from rep 1 to 2.

88/96

When to use quotes? " ', nothing, or backtick?89/96

When to use quotes? `"` `'`, nothing, or backtick?

Use no quotes (bare variable names) when the variable exists
Otherwise use strings

Example:

fries_long %>% 
  pivot_wider(names_from = type,
              values_from = rating)

french_fries %>% 
  pivot_longer(cols = potato:painty,
               names_to = "type", 
               values_to = "rating")

90/96

When to use quotes? `"` `'`, nothing, or backtick?

Variables with unusual names (starting with numbers, spaces, or containing special characters like !@#$%^&*()- need to be referenced with backticks:

data %>% select(`name with spaces`)

91/96

92/96

Lab exercise: Exploring data PISA data

Open pisa.Rmd on rstudio cloud.

93/96

Assignment 1

It will be launched later today

Instructions to appear on ED and the course website

When is the assignment due?

1st April, 2020 5pm

How do I complete the assignment?

You should complete as much of the assignment as you can by yourself
Then once you have done as much as you can, work with your group to

I don't have a group / I can't get in contact with my group

If you don't have a group, make sure you have filled in this form here (it has also been posted on ED)
I will assign everyone into a group who has filled in the form

94/96

Assignment 1

How do I stay in touch with my group?

Get in touch with your group and decide how you will work together
- you can use zoom through Monash to create video/audio group calls
- you could create a Slack team
- You can communicate via email, WhatsApp, Messenger, whatever you all agree on

How do I submit the assignment?

You submit the assignment via ED - instructions to follow

95/96

Lab Quiz

Time to take the lab quiz.

96/96

ETC1010: Introduction to Data Analysis

Week 2, part B

Week of Tidy Data + Style

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

11th Mar 2020

1/96

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow

ETC1010: Introduction to Data Analysis

Week 2, part B

Week of Tidy Data + Style

Update on how the class is delivered

How the class will now be delivered: Lectorials

How the class will now be delivered: Lab/quizzes

How the class will now be delivered

Assessignments

Other assessments

Consult times

There is a lot of change

Practice the most effective strategies we know

Social distancing is hard

There is a lot of change

Your Turn: complete class survey

How to learn

(demo)

recap

The "pipe" operator - %>%

The "pipe" operator - %>%

Problem solving (demo)

Style guide

File names and code chunk labels

Object names

Spacing

ggplot

Long lines

Assignment

Quotes

Overview

R Packages

Name clashes

Many R packages

Let's talk about data

Example: french fries

Example: french-fries - pivoting into long form

Example: french-fries - pivoting into long form

Example: french-fries - pivoting into long form

Example: french-fries - pivoting into long form

Example: french-fries - pivoting back

filter()

filter(): example

filter(): details

filter(): details

filter(): details

filter(): details

filter(): details

filter(): details

filter(): details

filter(): details

Your turn: open french-fries.Rmd

French Fries Filter: only week 1

French Fries Filter: oil type 1

French Fries Filter: oil types 1 and 3 but not 2

French Fries Filter: weeks 1-4 only

about %in%

select()

select()

select(): a comma separated list of variables, by name.

select(): drop selected variables by prefixing with -

select(): drop selected variables by prefixing with -

select()

select()

select(): Using it

select(): Using it

Your turn: back to the french fries data

mutate(): create a new variable; keep existing ones

mutate(): create a new variable; keep existing ones

Your turn: french fries

summarise(): boil data down to one row observation

summarise(): boil data down to one row observation

What if we want a summary for each type?

What if we want a summary for each type?

Using summarise() + group_by()

Your turn: Back to french-fries.Rmd

french fries answers

french fries answers

arrange(): orders data by a given variable.

arrange(): orders data by a given variable.

arrange()

The "pipe" operator - `%>%`

The "pipe" operator - `%>%`

`filter()`

`filter()`: example

`filter()`: details

`filter()`: details

`filter()`: details

`filter()`: details

`filter()`: details

`filter()`: details

`filter()`: details

`filter()`: details

about `%in%`

`select()`

`select()`

`select()`: a comma separated list of variables, by name.

`select()`: drop selected variables by prefixing with `-`

`select()`: drop selected variables by prefixing with `-`

`select()`

`select()`

`select()`: Using it

`select()`: Using it

`mutate()`: create a new variable; keep existing ones

`mutate()`: create a new variable; keep existing ones

`summarise()`: boil data down to one row observation

`summarise()`: boil data down to one row observation

What if we want a summary for each `type`?

What if we want a summary for each `type`?

Using `summarise()` + `group_by()`

`arrange()`: orders data by a given variable.

`arrange()`: orders data by a given variable.

`arrange()`

`arrange()` answers

`arrange()` answers

`count()` the number of things in a given column

Your turn: `count()`

When to use quotes? `"` `'`, nothing, or backtick?

When to use quotes? `"` `'`, nothing, or backtick?

When to use quotes? `"` `'`, nothing, or backtick?