<div class="shade_black"  style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;">
<i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=https://ida.numbat.space/slides/lecture_2a.pdf/>here for PDF <i class="fas fa-file-pdf"></i></a>.
</div>

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70 title-slide
count: false

.column.shade_black[.content[

<br>

# .monash-blue.outline-text[ETC1010: Introduction to Data Analysis]

<br>

.bottom_abs.width100[

Lecturer: *Nicholas Tierney*

Department of Econometrics and Business Statistics

<span><i class="fas  fa-envelope faa-float animated "></i></span>  ETC1010.Clayton-x@monash.edu

16th Mar 2020

<br>
]

]]

---
class: transition middle

# What is this song?

(Discuss with your neighbour)

---
class: transition middle

# Quick Talk about COVID-19

(Borrowed from [Dr. Andrew Heiss](https://evalsp20.classes.andrewheiss.com/slides/PMAP-8521_2020-03-11.pdf))

---
# What is all this

- New virus in the coronavirus family
- Officially named "SARS-COV-2"
- Causes Respiratory disease named COVID-19
- Do not call it "Chinese Coronavirus" or "Kung Flu" or other xenophobic names!

---

# Symptoms

- Fever and dry cough initially; pneumonia-like
- respiratory failure later for vulnerable people

- Up to two weeks can pass between exposure and symptoms

---

background-image: url(gifs/covid19-flatten-curve.gif)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

---
background-image: url(https://timchurches.github.io/blog/posts/2020-03-10-modelling-the-effects-of-public-health-interventions-on-covid-19-transmission-part-1/modelling-the-effects-of-public-health-interventions-on-covid-19-transmission-part-1_files/figure-html5/unnamed-chunk-13-1.png)
background-size: contain
background-position: 50% 50%
class: center, bottom, white

From [Tim Church's blog](https://timchurches.github.io/blog/posts/2020-03-10-modelling-the-effects-of-public-health-interventions-on-covid-19-transmission-part-1/)

---
# What can you do?

- Wash hands for 20 seconds
- Disinfect phone
- Don't touch your face
- Stay home if you’re sick
- Practice social distancing
- Limit non-essential travel
- Don’t buy masks
- Stock up on essentials but don’t hoard

---

# What can we do?

- We **will** get through this
- Humor can be an effective way to assist with reducing anxiety in these types of situations [(Yovetich et al, 1990)](https://journals.sagepub.com/doi/abs/10.2466/pr0.1990.66.1.51?casa_token=b-L7KSArSkcAAAAA:GvljfzwAkPjvs2Fo4li4pVEL_YRenTzCGApBlW2L7fQwNnr4BKBjCbNJk7ijRi7GTbWPFKyczBKGEw)
- On that note...

https://www.instagram.com/p/B9FFVnigLEE/?utm_source=ig_embed

Singapore's videos on COVID19

- https://www.youtube.com/watch?v=Hcx0LJJ-hLU
- https://www.youtube.com/watch?v=ywOEkzO86ms

Vietnam's awesome pop track

- https://www.youtube.com/watch?v=V9YirNgAzXI

---

# What does this mean for our class?

- **Stay home if you are feeling unwell**
- **Lectorials are now being recorded**
- Monash is advising everyone to proceed as normal, unless you are feeling unwell
- **if you are feeling unwell in any way do not come to university**
- I am committed to help you all succeed and keep learning!
- [Monash's COVID19 Updates](https://www.monash.edu/news/coronavirus-updates)
- [Monash's COVID19 Fact sheet](https://www.monash.edu/news/novel-corona-virus-fact-sheet)

---
class: refresher

# Recap

- packages are installed with ___ ?
- packages are loaded with ___ ?
- Why do we care about Reproducibility?
- Output + input of rmarkdown

---
class: transition middle

# About your instructors

---
# Nick

.pull-left[
* 🎓 Bachelor of Psychological Sciences UQ
*  🎓 PhD in Statistics at QUT. 
* Research: missing data, data visualisation, statistical computing
* R 📦: `naniar`, `visdat`, 
* `#rstats` 🎤: Credibly Curious w Saskia Freytag
* ❤️ outdoors, especially: 🥾, 🏃‍♂️, and 🧗‍♂️.
]

.pull-right[
<img src="images/njtierney.jpg" width="80%" style="display: block; margin: auto;" />

]

---
# Steph

.pull-left[
* 🎓 Bachelor of Economics and Bachelor of Commerce from Monash
* Studying a Masters of Statistics at QUT, based at Monash.  
* Loves to read 📖, any and all recommendations are welcome.
* Has an R package called [taipan](https://github.com/srkobakian/taipan), and another called [sugarbag](https://github.com/srkobakian/sugarbag).
]

.pull-right[
<img src="images/steph.jpeg" width="80%" style="display: block; margin: auto;" />

]

---

# Sarah

- 🎓 MPhil student in Applied Mathematics and Statistics at Monash University. Research predicts mosquito behaviour (ask me for mosquito facts!)
- Commenced in 2017, moved from Adelaide
- Loves figure skating ⛸

---

# Nitika

* 🎓 Bachelor of Bioinformatics
* 🎓 Master of Bioinformatics
* Current: PhD Student in the Faculty of Medicine Nursing and Health Sciences
* Data Officer with [Monash Data Fluency](https://monashdatafluency.github.io/)
* Research: Bioinformatics analysis with RNA seq data
* ❤️ Travel, Food, Anime, D&D.

---

# Sherry

.pull-left[
- 🎓 Bachelor of Commerce 2018
- Honours in Econometrics 2019 with Di Cook 
- Commenced PhD programme 2020
- Created her first ever R package, `quickdraw`
- Loves puzzles games like jigsaws 🧩. 
]

.pull-right[
<img src="images/sherry.jpeg" width="80%" style="display: block; margin: auto;" />
]

---

# Di

.pull-left[
- Professor at Monash University in Melbourne Australia, doing research in statistics, data science, visualisation, and statistical computing. 
- Created the current version of the course
- Likes to play all sorts of sports, tennis, soccer, hockey, cricket, and go boogie boarding.
]

.pull-right[
<img src="images/di.png" width="80%" style="display: block; margin: auto;" />
]

---
class: transition left
# Your Turn: Making the groups

We are going to set up the groups for doing assignment work.

1. Find your name from the list at [this link](https://ida.numbat.space/groups/groups)
2. Find the other people in the class with the same group as you (feel free to wander around the class!)
3. Grab your gear and claim a table to work together at
4. Email the group to work out how to best stay in touch

---
class: transition left
# Your Turn: Ask your team mates these questions:

1. What is one food you'd never want to taste again?
2. If you were a comic strip character, who would you be and why?

LASTLY, come up with a name for your team (we have provided a suggested name, but you are free to change it!) and tell this to a tutor, along with the names of members of the team.

---
# Traffic Light System

---
# Traffic Light System

.pull-left.middle[

.red[
# Red Post-it
]

* I need a hand
* Slow down

]

.pull-right.middle[
.green[
# Green Post-it
]

* I am up to speed
* I have completed  the thing
]

---

# Today: Outline

- Tidy Data
- Terminology of data
- Different examples of data 
- Steps in making data tidy
- Lots of examples

---

# A note on difficulty

* This is not a programming course - it is a course about **data, modelling, and computing**.

- At the moment, you might be sitting there, feeling a bit confused about where we are, what are are doing, what R is, and how it even works.
- That is OK!

- The theory of this class will only get you so far
- The real learning happens from doing the data analysis - the **pressure of a deadline can also help.**

- I want to take a moment to run through RStudio, what it is, and how it works again. (demo)

---
# Tidy Data

.blockquote[
You're ready to sit down with a newly-obtained dataset, excited about how it will open a world of insight and understanding, and then find you can't use it. You'll first have to spend a significant amount of time to restructure the data to even begin to produce a set of basic descriptive statistics or link it to other data you've been using.

--John Spencer 
([Measure Evaluation](https://www.measureevaluation.org/resources/newsroom/blogs/tidy-data-and-how-to-get-it))
]

---
# Tidy Data

.blockquote[
"Tidy data" is a term meant to provide a framework for producing data that conform to standards that make data easier to use. Tidy data may still require some cleaning for analysis, but the job will be much easier.

--John Spencer 
([Measure Evaluation](https://www.measureevaluation.org/resources/newsroom/blogs/tidy-data-and-how-to-get-it))
]

---

# Example: US graduate programs

- Data from a study on US grad programs. 
- Originally came in an excel file containing rankings of many different programs. 
- Contains information on four programs:
  1. Astronomy
  1. Economics
  1. Entomology, and 
  1. Psychology

---

# Example: US graduate programs

```r
library(tidyverse)
grad <- read_csv(here::here("slides/data/graduate-programs.csv"))
grad
## # A tibble: 412 x 16
##    subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##    <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
##  1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
##  2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
##  3 econom… BOST…      0.51      1.03         43.5          46.8             5   
##  4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
##  5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
##  6 econom… BROW…      0.84      2.31         27.1          54.6             6   
##  7 econom… CALI…      0.99      2.31         56.4          83.3             4   
##  8 econom… CARN…      0.43      1.67         35.2          45.6             5.05
##  9 econom… CITY…      0.35      1.06         38.1          27.9             5.2 
## 10 econom… CLAR…      0.47      0.7          24.7          37.7             5.17
## # … with 402 more rows, and 9 more variables: PctMinorityFac <dbl>,
## #   PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>,
## #   AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>
```

---
# Example: US graduate programs

Good things about the format:

```
## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>
```

- **Rows** contain information about the institution

- **Columns** contain types of information, like average number of publications, average number of citations, % completion,

---

# Example: US graduate programs

Easy to make summaries:

```r
grad %>% count(subject)
## # A tibble: 4 x 2
##   subject        n
##   <chr>      <int>
## 1 astronomy     32
## 2 economics    117
## 3 entomology    27
## 4 psychology   236
```

---

# Example: US graduate programs

Easy to make summaries:

```r
grad %>%
  filter(subject == "economics") %>%
  summarise(mean = mean(NumStud),
            s = sd(NumStud))
## # A tibble: 1 x 2
##    mean     s
##   <dbl> <dbl>
## 1  60.7  39.4
```

---

# Example: US graduate programs

Easy to make a plot

.pull-left[

```r
grad %>%
  filter(subject == "economics") %>%
  ggplot(aes(x = NumStud, 
             y = MedianTimetoDegree)) +
  geom_point() + 
  theme(aspect.ratio = 1)
```
]

.pull-right[
<img src="lecture_2a_files/figure-html/gra-dplot-out-1.png" width="100%" style="display: block; margin: auto;" />
]

---
class: transition left
# Your Turn: Open Lecture 2A in rstudio cloud

- Notice the `data/` directory with many datasets! 
- Open `graduate-programs.Rmd`
- Answer these questions:
    - "What is the average number of graduate students per economics program?"
    - "What is the best description of the relationship between number of students and median time to degree?"
- Use the traffic light system if you need a hand.

???

- "The average number of graduate students per economics program is:"
- "about 61" (correct)
- about 39

"What is the best description of the relationship between number of students and median time to degree?"

- "as the number of students increases the median time to degree increases, weakly" (correct)
- as the number of students increases the variability in median time to degree decreases

---
class: refresher
.left-code[
What could this image say about R?

]

.right-plot[
<img src="images/tower-of-babel.jpg" width="100%" style="display: block; margin: auto;" />
]

---
# Terminology of data: Variable

- A quantity, quality, or property that you can measure. 
- For the grad programs, these would be all the column headers.

---
# Terminology of data: Observation

- A set of measurements made under similar conditions
- Contains several values, each associated with a different variable.
- For the grad programs, this is institution, and program, uniquley define the observation.

---
# Terminology of data: Value

- Is the state of a variable when you measure it. 
- The value of a variable typically changes from observation to observation.
- For the grad programs, this is the value in each cell

---
# Tidy tabular form

__Tabular data__ is a set of values, each associated with a variable and an observation. Tabular data is __tidy__ iff (if and only if):

* Each variable in its own column, 
* Each observation in its own row,
* Each value is placed in its own `cell`.

---
class: transition
# Different examples of data

For each of these data examples, **let's try together to identify the variables and the observations** - some are HARD!

---

# The grad program

Is in **tidy** tabular form.

```
## # A tibble: 412 x 16
##    subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##    <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
##  1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
##  2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
##  3 econom… BOST…      0.51      1.03         43.5          46.8             5   
##  4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
##  5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
##  6 econom… BROW…      0.84      2.31         27.1          54.6             6   
##  7 econom… CALI…      0.99      2.31         56.4          83.3             4   
##  8 econom… CARN…      0.43      1.67         35.2          45.6             5.05
##  9 econom… CITY…      0.35      1.06         38.1          27.9             5.2 
## 10 econom… CLAR…      0.47      0.7          24.7          37.7             5.17
## # … with 402 more rows, and 9 more variables: PctMinorityFac <dbl>,
## #   PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>,
## #   AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>
```

---

# Your Turn: Genes experiment 🤔

```
## # A tibble: 3 x 12
##   id    `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2`
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1 Gene…      2.18     2.20       4.20     2.63       5.06       4.54       5.53
## 2 Gene…      1.46     0.585      1.86     0.515      2.88       1.36       2.96
## 3 Gene…      2.03     0.870      3.28     0.533      4.63       2.18       5.56
## # … with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
## #   `WM-12.R4` <dbl>
```

---
# Melbourne weather 😨

```
## # A tibble: 1,593 x 12
##    X1             X2 X3    X4       X5    X9   X13   X17   X21   X25   X29   X33
##    <chr>       <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ASN00086282  1970 07    TMAX    141   124   113   123   148   149   139   153
##  2 ASN00086282  1970 07    TMIN     80    63    36    57    69    47    84    78
##  3 ASN00086282  1970 07    PRCP      3    30     0     0    36     3     0     0
##  4 ASN00086282  1970 08    TMAX    145   128   150   122   109   112   116   142
##  5 ASN00086282  1970 08    TMIN     50    61    75    67    41    51    48    -7
##  6 ASN00086282  1970 08    PRCP      0    66     0    53    13     3     8     0
##  7 ASN00086282  1970 09    TMAX    168   168   162   162   162   150   184   179
##  8 ASN00086282  1970 09    TMIN     19    29    62    81    81    55    73    97
##  9 ASN00086282  1970 09    PRCP      0     0     0     0     3     5     0    38
## 10 ASN00086282  1970 10    TMAX    189   194   204   267   256   228   237   144
## # … with 1,583 more rows
```

---
# Tuberculosis notifications data taken from [WHO](http://www.who.int/tb/country/data/download/en/) 🤧

```
## # A tibble: 3,202 x 22
##    country  year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 new_sp_m2534
##    <chr>   <dbl>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl>
##  1 Afghan…  1997         NA          NA           0           10            6
##  2 Afghan…  1998         NA          NA          30          129          128
##  3 Afghan…  1999         NA          NA           8           55           55
##  4 Afghan…  2000         NA          NA          52          228          183
##  5 Afghan…  2001         NA          NA         129          379          349
##  6 Afghan…  2002         NA          NA          90          476          481
##  7 Afghan…  2003         NA          NA         127          511          436
##  8 Afghan…  2004         NA          NA         139          537          568
##  9 Afghan…  2005         NA          NA         151          606          560
## 10 Afghan…  2006         NA          NA         193          837          791
## # … with 3,192 more rows, and 15 more variables: new_sp_m3544 <dbl>,
## #   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,
## #   new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## #   new_sp_f65 <dbl>, new_sp_fu <dbl>
```

---
# French fries

.pull-left[
- 10 week sensory experiment
- 12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?)
- fried in one of 3 different oils, replicated twice. 
]

.pull-right[

]

---
# French fries: Variables? Observations?

```
## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows
```

---
# Rude Recliners  data

- data is collated from this story: [41% Of Fliers Think You're Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/)

- What are the variables?

```
## # A tibble: 3 x 6
##   V1         `V2:Always` `V2:Usually` `V2:About half the… `V2:Once in a wh… `V2:Never`
##   <chr>            <dbl>        <dbl>               <dbl>             <dbl>      <dbl>
## 1 No, not r…         124          145                  82               116         35
## 2 Yes, some…           9           27                  35               129         81
## 3 Yes, very…           3            3                  NA                11         54
```

---

# Messy vs tidy

.pull-left[
Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution. 
]

.pull-right[
Tidy data can be though of as legos. Once you have this form, you can put it together in so many different ways, to make different analyses.

<img src="images/lego.png" width="100%" style="display: block; margin: auto;" />
]

---
# Data Tidying verbs

- `pivot_longer`: Specify the **names_to** (identifiers) and the **values_to** (measures) to make longer form data.
- `pivot_wider`: Variables split out in columns
- `separate`: Split one column into many

---

# one more time: `pivot_longer`

```r
pivot_longer(<DATA>,
             <COLS>,
             <NAMES_TO>
             <VALUES_TO>)
```

- **Cols** to select are those that represent values, not variables.
- **names_to** variable name for current column names.
- **values_to** variable name whose values are spread over the cells.

---
# `pivot_longer`: example

.left-code[

```r
table4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766
```

]

.right-plot[

```r
table4a %>% 
  pivot_longer(cols = c("1999", "2000"),
               names_to = "year",
               values_to = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766
```

]

---
# Tidying genes data

Tell me what to put in the following?

- **cols** are the columns that represent values, not variables.
- **names_to** is the name of new variable whose values for the column names.
- **values_to** is the name of the new variable whose values are spread over the cells.

---
# Tidy genes data

.left-code[

```r
genes
## # A tibble: 3 x 12
##   id    `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2`
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1 Gene…      2.18     2.20       4.20     2.63       5.06       4.54       5.53
## 2 Gene…      1.46     0.585      1.86     0.515      2.88       1.36       2.96
## 3 Gene…      2.03     0.870      3.28     0.533      4.63       2.18       5.56
## # … with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
## #   `WM-12.R4` <dbl>
```

]

.right-plot[

```r
genes_long <- genes %>% 
  pivot_longer(cols = -id,
               names_to = "variable",
               values_to = "expr")

genes_long
## # A tibble: 33 x 3
##    id     variable  expr
##    <chr>  <chr>    <dbl>
##  1 Gene 1 WI-6.R1   2.18
##  2 Gene 1 WI-6.R2   2.20
##  3 Gene 1 WI-6.R4   4.20
##  4 Gene 1 WM-6.R1   2.63
##  5 Gene 1 WM-6.R2   5.06
##  6 Gene 1 WI-12.R1  4.54
##  7 Gene 1 WI-12.R2  5.53
##  8 Gene 1 WI-12.R4  4.41
##  9 Gene 1 WM-12.R1  3.85
## 10 Gene 1 WM-12.R2  4.18
## # … with 23 more rows
```
]

---
# Separate columns

.left-code[

```r
genes_long
## # A tibble: 33 x 3
##    id     variable  expr
##    <chr>  <chr>    <dbl>
##  1 Gene 1 WI-6.R1   2.18
##  2 Gene 1 WI-6.R2   2.20
##  3 Gene 1 WI-6.R4   4.20
##  4 Gene 1 WM-6.R1   2.63
##  5 Gene 1 WM-6.R2   5.06
##  6 Gene 1 WI-12.R1  4.54
##  7 Gene 1 WI-12.R2  5.53
##  8 Gene 1 WI-12.R4  4.41
##  9 Gene 1 WM-12.R1  3.85
## 10 Gene 1 WM-12.R2  4.18
## # … with 23 more rows
```

]

.right-plot[

```r
genes_long %>%
  separate(col = variable, 
           into = c("trt", "leftover"),
           sep = "-")
## # A tibble: 33 x 4
##    id     trt   leftover  expr
##    <chr>  <chr> <chr>    <dbl>
##  1 Gene 1 WI    6.R1      2.18
##  2 Gene 1 WI    6.R2      2.20
##  3 Gene 1 WI    6.R4      4.20
##  4 Gene 1 WM    6.R1      2.63
##  5 Gene 1 WM    6.R2      5.06
##  6 Gene 1 WI    12.R1     4.54
##  7 Gene 1 WI    12.R2     5.53
##  8 Gene 1 WI    12.R4     4.41
##  9 Gene 1 WM    12.R1     3.85
## 10 Gene 1 WM    12.R2     4.18
## # … with 23 more rows
```
]

---
# Separate columns

.pull-left[

```r
genes_long_tidy <- genes_long %>%
  separate(variable, 
           into = c("trt", "leftover"), 
           sep = "-") %>%
  separate(leftover, 
           into = c("time", "rep"), 
           sep = "\\.") 
```
]

.pull-right[

```r
genes_long_tidy
## # A tibble: 33 x 5
##    id     trt   time  rep    expr
##    <chr>  <chr> <chr> <chr> <dbl>
##  1 Gene 1 WI    6     R1     2.18
##  2 Gene 1 WI    6     R2     2.20
##  3 Gene 1 WI    6     R4     4.20
##  4 Gene 1 WM    6     R1     2.63
##  5 Gene 1 WM    6     R2     5.06
##  6 Gene 1 WI    12    R1     4.54
##  7 Gene 1 WI    12    R2     5.53
##  8 Gene 1 WI    12    R4     4.41
##  9 Gene 1 WM    12    R1     3.85
## 10 Gene 1 WM    12    R2     4.18
## # … with 23 more rows
```

]

---
# Demo: koala bilby data

Here is a little data to practice `pivot_longer`, `pivot_wider` and `separate` on.

```
## # A tibble: 5 x 5
##   ID    koala_NSW koala_VIC bilby_NSW bilby_VIC
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 grey         23        43        11         8
## 2 cream        56        89        22        17
## 3 white        35        72        13         6
## 4 black        28        44        19        16
## 5 taupe        25        37        21        12
```

---
# Exercise: koala bilby data

- Read over `koala-bilby.Rmd`
- `pivot_longer` the data into long form, naming the two new variables, `label` and `count`
- Separate the labels into two new variables, `animal`, `state`
- `pivot_wider` the long form data into wide form, where the columns are the states. 
- `pivot_wider` the long form data into wide form, where the columns are the animals.

---
# Exercise 1: Rude Recliners

- Open `rude-recliners.Rmd`
- This contains data from the article [41% Of Fliers Think You're Rude If You Recline Your Seat](http://fivethirtyeight.com/datalab/airplane-etiquette-recline-seat/). 
- V1 is the response to question: "Is it rude to recline your seat on a plane?"
- V2 is the response to question: "Do you ever recline your seat when you fly?".

---
# Exercise 1: Rude Recliners (15 minutes)

Answer the following questions in the `rude-recliners.Rmd` rmarkdown document.

- A) What are the variables and observations in this data?

- 1B) Put the data in tidy long form (using the names `V2` as the key variable, and `count` as the value).

- 1C) Use the `rename` function to make the variable names a little shorter.

---
class: transition left
# Exercise 1: Answers

---
class: transition left
# Your Turn: Turn to the people next to you and ask 2 questions:

- Are you more of a dog or a cat person?
- What languages do you know how to speak?

---
# Exercise 2: Tuberculosis Incidence data (15 minutes)

Open: `tb-incidence.Rmd`

Tidy the TB incidence data, using the Rmd to prompt questions.

---
# Exercise 3: Currency rates (15 minutes)

- open `currency-rates.Rmd`
- read in `rates.csv`
- Answer the following questions:

1. What are the variables and observations?
2. pivot_longer the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy long form.
3. Make line plots of the currencies, describe the similarities and differences between the currencies.

---
# Exercise 4: Australian Airport Passengers (optional!)

- Open `oz-airport.Rmd`
- Contains data from the web site [Department of Infrastructure, Regional Development and Cities](https://bitre.gov.au/publications/ongoing/airport_traffic_data.aspx), containing data on Airport Traffic Data 1985–86 to 2017–18.

- Read the dataset, into R, naming it `passengers`
- Tidy the data, to produce a data set with these columns
    - airport: all of the airports. 
    - year 
    - type_of_flight: DOMESTIC, INTERNATIONAL
    - bound: IN or OUT

---
class: refresher

# Recap

- Traffic Light System: Green = "good!" ; Red = "Help!"
- R + Rstudio
- Functions are  ___
- columns in data frames are accessed with ___ ?
.red[If you have questions, place a red sticky note on your laptop.]

.white[If you are done, place a green sticky on your laptop]

---

# Lab quiz

Time to take the lab quiz.

---
class: informative
# A note on `pivot_wider` and `pivot_longer`, `gather` and `spread`

(Not needed to know for the course, but nice to know)

- Naming things is hard
- There are many ways to do the same thing in R
- You might have come across `pivot_` functions as `spread` or `gather`. These are still valid, but have been improved upon in the latest version of the `tidyr` package.
- You can read more about this change here:
  - [tidyverse blog post](https://www.tidyverse.org/blog/2019/09/tidyr-1-0-0/)
  - [tidyr vignette](http://tidyr.tidyverse.org/articles/pivot.html)

---
# COVID19 references

- [Monash factsheet](https://www.monash.edu/news/novel-corona-virus-fact-sheet/_nocache)
- [Flatten the curve](https://www.flattenthecurve.com/)
- [Johns Hopkins interactive map](https://www.arcgis.com/apps/opsdashboard/index.html#/bda7594740fd40299423467b48e9ecf6)
- [Epidemiology modelling of COVID19](https://alhill.shinyapps.io/COVID19seir/)
- [Simulation of flattening the curve](https://robertasmith.shinyapps.io/covid19_shiny/)
- [COVID19 dashboard](https://ramikrispin.github.io/coronavirus_dashboard/)
- [COVID19 Data](https://ramikrispin.github.io/coronavirus/)
- [Dr. Norman Swan](https://www.youtube.com/watch?v=znJ9RD8gYsQ&feature=share&fbclid=IwAR3JBxaVw13dnlHVwMiFdlLjoyhmy2AroO6gmJj7zNwoa-ROBzZ6f9nzJtI)

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70
count: false

.column.shade_black[.content[

# That's it!

.bottom_abs.width100[

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics<br>
<span><i class="fas  fa-envelope faa-float animated "></i></span>  ETC1010.Clayton-x@monash.edu

16th Mar 2020

]

<br />
This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-nc-sa/4.0/">Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License</a>.

]]

???

# Now let's use `pivot_wider` to examine different aspects

# Examine treatments against each other

.left-code[

```r
genes_long_tidy %>%
  pivot_wider(
    id_cols = c(id, rep, time),
    names_from = trt, 
    values_from = expr
    ) %>%
  ggplot(aes(x = WI, 
             y = WM, 
             colour = id)) + 
  geom_point()
```
]

.right-plot[
<img src="lecture_2a_files/figure-html/plot-genes-out-1.png" width="100%" style="display: block; margin: auto;" />
]

Generally, some negative association within each gene, WM is low if WI is high.

# Examine replicates against each other

.left-code[

```r
genes_long_tidy %>%
  pivot_wider(id_cols = c(id, trt, time),
              names_from = rep, 
              values_from = expr) %>%
  ggplot(aes(x=R1, y=R4, colour=id)) + 
  geom_point() + coord_equal()
```
]

.right-plot[
<img src="lecture_2a_files/figure-html/shoe-replicates-out-1.png" width="100%" style="display: block; margin: auto;" />
]

Roughly, replicate 4 is like replicate 1, eg if one is low, the other is low.

That's a good thing, that the replicates are fairly similar.

Notes for current slide

Notes for next slide

These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for PDF .

Press the right arrow to progress to the next slide!

1/67

ETC1010: Introduction to Data Analysis

Week 2, part A

Week of Tidy Data

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

16th Mar 2020

1/67

What is this song?

(Discuss with your neighbour)

2/67

Quick Talk about COVID-19

(Borrowed from Dr. Andrew Heiss)

3/67

What is all thisNew virus in the coronavirus family
Officially named "SARS-COV-2"
Causes Respiratory disease named COVID-19
Do not call it "Chinese Coronavirus" or "Kung Flu" or other xenophobic names!
4/67

Symptoms

Fever and dry cough initially; pneumonia-like
respiratory failure later for vulnerable people
Up to two weeks can pass between exposure and symptoms

5/67

6/67

From Tim Church's blog

7/67

What can you do?Wash hands for 20 seconds
Disinfect phone
Don't touch your face
Stay home if you’re sick
Practice social distancing
Limit non-essential travel
Don’t buy masks
Stock up on essentials but don’t hoard
8/67

What can we do?

We will get through this
Humor can be an effective way to assist with reducing anxiety in these types of situations (Yovetich et al, 1990)
On that note...

https://www.instagram.com/p/B9FFVnigLEE/?utm_source=ig_embed

Singapore's videos on COVID19

Vietnam's awesome pop track

https://www.youtube.com/watch?v=V9YirNgAzXI

9/67

What does this mean for our class?

Stay home if you are feeling unwell
Lectorials are now being recorded
Monash is advising everyone to proceed as normal, unless you are feeling unwell
if you are feeling unwell in any way do not come to university
I am committed to help you all succeed and keep learning!
Monash's COVID19 Updates
Monash's COVID19 Fact sheet

10/67

Recappackages are installed with _ ?
packages are loaded with _ ?
Why do we care about Reproducibility?
Output + input of rmarkdown
11/67

About your instructors12/67

Nick

🎓 Bachelor of Psychological Sciences UQ
🎓 PhD in Statistics at QUT.
Research: missing data, data visualisation, statistical computing
R 📦: naniar, visdat,
#rstats 🎤: Credibly Curious w Saskia Freytag
❤️ outdoors, especially: 🥾, 🏃‍♂️, and 🧗‍♂️.

13/67

Steph

🎓 Bachelor of Economics and Bachelor of Commerce from Monash
Studying a Masters of Statistics at QUT, based at Monash.
Loves to read 📖, any and all recommendations are welcome.
Has an R package called taipan, and another called sugarbag.

14/67

Sarah🎓 MPhil student in Applied Mathematics and Statistics at Monash University. Research predicts mosquito behaviour (ask me for mosquito facts!)
Commenced in 2017, moved from Adelaide
Loves figure skating ⛸
15/67

Nitika

🎓 Bachelor of Bioinformatics
🎓 Master of Bioinformatics
Current: PhD Student in the Faculty of Medicine Nursing and Health Sciences
Data Officer with Monash Data Fluency
Research: Bioinformatics analysis with RNA seq data
❤️ Travel, Food, Anime, D&D.

16/67

Sherry

🎓 Bachelor of Commerce 2018
Honours in Econometrics 2019 with Di Cook
Commenced PhD programme 2020
Created her first ever R package, quickdraw
Loves puzzles games like jigsaws 🧩.

17/67

Di

Professor at Monash University in Melbourne Australia, doing research in statistics, data science, visualisation, and statistical computing.
Created the current version of the course
Likes to play all sorts of sports, tennis, soccer, hockey, cricket, and go boogie boarding.

18/67

Your Turn: Making the groups

We are going to set up the groups for doing assignment work.

Find your name from the list at this link
Find the other people in the class with the same group as you (feel free to wander around the class!)
Grab your gear and claim a table to work together at
Email the group to work out how to best stay in touch

19/67

Your Turn: Ask your team mates these questions:

What is one food you'd never want to taste again?
If you were a comic strip character, who would you be and why?

LASTLY, come up with a name for your team (we have provided a suggested name, but you are free to change it!) and tell this to a tutor, along with the names of members of the team.

05:00

20/67

Traffic Light System

21/67

Traffic Light SystemRed Post-it
I need a hand
Slow down
22/67

Traffic Light SystemRed Post-it
I need a hand
Slow down
Green Post-it
I am up to speed
I have completed  the thing
22/67

Today: OutlineTidy Data
Terminology of data
Different examples of data 
Steps in making data tidy
Lots of examples
23/67

A note on difficultyThis is not a programming course - it is a course about data, modelling, and computing. 
24/67

A note on difficultyThis is not a programming course - it is a course about data, modelling, and computing. 
At the moment, you might be sitting there, feeling a bit confused about where we are, what are are doing, what R is, and how it even works.
That is OK!
24/67

A note on difficulty

This is not a programming course - it is a course about data, modelling, and computing.

At the moment, you might be sitting there, feeling a bit confused about where we are, what are are doing, what R is, and how it even works.
That is OK!
The theory of this class will only get you so far
The real learning happens from doing the data analysis - the pressure of a deadline can also help.

24/67

A note on difficulty

This is not a programming course - it is a course about data, modelling, and computing.

At the moment, you might be sitting there, feeling a bit confused about where we are, what are are doing, what R is, and how it even works.
That is OK!
The theory of this class will only get you so far
The real learning happens from doing the data analysis - the pressure of a deadline can also help.
I want to take a moment to run through RStudio, what it is, and how it works again. (demo)

24/67

Tidy Data

You're ready to sit down with a newly-obtained dataset, excited about how it will open a world of insight and understanding, and then find you can't use it. You'll first have to spend a significant amount of time to restructure the data to even begin to produce a set of basic descriptive statistics or link it to other data you've been using.

--John Spencer (Measure Evaluation)

25/67

Tidy Data

"Tidy data" is a term meant to provide a framework for producing data that conform to standards that make data easier to use. Tidy data may still require some cleaning for analysis, but the job will be much easier.

--John Spencer (Measure Evaluation)

26/67

Example: US graduate programsData from a study on US grad programs. 
Originally came in an excel file containing rankings of many different programs. 
Contains information on four programs:Astronomy
Economics
Entomology, and 
Psychology

27/67

Example: US graduate programs

library(tidyverse)
grad <- read_csv(here::here("slides/data/graduate-programs.csv"))
grad
## # A tibble: 412 x 16
##    subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##    <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
##  1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
##  2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
##  3 econom… BOST…      0.51      1.03         43.5          46.8             5   
##  4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
##  5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
##  6 econom… BROW…      0.84      2.31         27.1          54.6             6   
##  7 econom… CALI…      0.99      2.31         56.4          83.3             4   
##  8 econom… CARN…      0.43      1.67         35.2          45.6             5.05
##  9 econom… CITY…      0.35      1.06         38.1          27.9             5.2 
## 10 econom… CLAR…      0.47      0.7          24.7          37.7             5.17
## # … with 402 more rows, and 9 more variables: PctMinorityFac <dbl>,
## #   PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>,
## #   AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

28/67

Example: US graduate programs

Good things about the format:

## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

29/67

Example: US graduate programs

Good things about the format:

## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

Rows contain information about the institution

29/67

Example: US graduate programs

Good things about the format:

## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

Rows contain information about the institution
Columns contain types of information, like average number of publications, average number of citations, % completion,

29/67

Example: US graduate programs

Easy to make summaries:

grad %>% count(subject)
## # A tibble: 4 x 2
##   subject        n
##   <chr>      <int>
## 1 astronomy     32
## 2 economics    117
## 3 entomology    27
## 4 psychology   236

30/67

Example: US graduate programs

Easy to make summaries:

grad %>%
  filter(subject == "economics") %>%
  summarise(mean = mean(NumStud),
            s = sd(NumStud))
## # A tibble: 1 x 2
##    mean     s
##   <dbl> <dbl>
## 1  60.7  39.4

31/67

Example: US graduate programs

Easy to make a plot

grad %>%
  filter(subject == "economics") %>%
  ggplot(aes(x = NumStud, 
             y = MedianTimetoDegree)) +
  geom_point() + 
  theme(aspect.ratio = 1)

32/67

Your Turn: Open Lecture 2A in rstudio cloudNotice the data/ directory with many datasets! 
Open graduate-programs.Rmd
Answer these questions:"What is the average number of graduate students per economics program?"
"What is the best description of the relationship between number of students and median time to degree?"

Use the traffic light system if you need a hand.

03:00
33/67

"The average number of graduate students per economics program is:"
"about 61" (correct)
about 39

"What is the best description of the relationship between number of students and median time to degree?"

"as the number of students increases the median time to degree increases, weakly" (correct)
as the number of students increases the variability in median time to degree decreases

What could this image say about R?

03:00

34/67

Terminology of data: Variable

A quantity, quality, or property that you can measure.
For the grad programs, these would be all the column headers.

## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

35/67

Terminology of data: Observation

A set of measurements made under similar conditions
Contains several values, each associated with a different variable.
For the grad programs, this is institution, and program, uniquley define the observation.

## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

36/67

Terminology of data: Value

Is the state of a variable when you measure it.
The value of a variable typically changes from observation to observation.
For the grad programs, this is the value in each cell

## # A tibble: 6 x 16
##   subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##   <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
## 1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
## 2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
## 3 econom… BOST…      0.51      1.03         43.5          46.8             5   
## 4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
## 5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
## 6 econom… BROW…      0.84      2.31         27.1          54.6             6   
## # … with 9 more variables: PctMinorityFac <dbl>, PctFemaleFac <dbl>,
## #   PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>, AvGREs <dbl>,
## #   TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

37/67

Tidy tabular form

Tabular data is a set of values, each associated with a variable and an observation. Tabular data is tidy iff (if and only if):

Each variable in its own column,
Each observation in its own row,
Each value is placed in its own cell.

38/67

Different examples of data

For each of these data examples, let's try together to identify the variables and the observations - some are HARD!

39/67

The grad program

Is in tidy tabular form.

## # A tibble: 412 x 16
##    subject Inst  AvNumPubs AvNumCits PctFacGrants PctCompletion MedianTimetoDeg…
##    <chr>   <chr>     <dbl>     <dbl>        <dbl>         <dbl>            <dbl>
##  1 econom… ARIZ…      0.9       1.57         31.3          31.7             5.6 
##  2 econom… AUBU…      0.79      0.64         77.6          44.4             3.84
##  3 econom… BOST…      0.51      1.03         43.5          46.8             5   
##  4 econom… BOST…      0.49      2.66         36.9          34.2             5.5 
##  5 econom… BRAN…      0.3       3.03         36.8          48.7             5.29
##  6 econom… BROW…      0.84      2.31         27.1          54.6             6   
##  7 econom… CALI…      0.99      2.31         56.4          83.3             4   
##  8 econom… CARN…      0.43      1.67         35.2          45.6             5.05
##  9 econom… CITY…      0.35      1.06         38.1          27.9             5.2 
## 10 econom… CLAR…      0.47      0.7          24.7          37.7             5.17
## # … with 402 more rows, and 9 more variables: PctMinorityFac <dbl>,
## #   PctFemaleFac <dbl>, PctFemaleStud <dbl>, PctIntlStud <dbl>, AvNumPhDs <dbl>,
## #   AvGREs <dbl>, TotFac <dbl>, PctAsstProf <dbl>, NumStud <dbl>

40/67

Your Turn: Genes experiment 🤔

## # A tibble: 3 x 12
##   id    `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2`
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1 Gene…      2.18     2.20       4.20     2.63       5.06       4.54       5.53
## 2 Gene…      1.46     0.585      1.86     0.515      2.88       1.36       2.96
## 3 Gene…      2.03     0.870      3.28     0.533      4.63       2.18       5.56
## # … with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
## #   `WM-12.R4` <dbl>

02:00

41/67

Melbourne weather 😨

## # A tibble: 1,593 x 12
##    X1             X2 X3    X4       X5    X9   X13   X17   X21   X25   X29   X33
##    <chr>       <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1 ASN00086282  1970 07    TMAX    141   124   113   123   148   149   139   153
##  2 ASN00086282  1970 07    TMIN     80    63    36    57    69    47    84    78
##  3 ASN00086282  1970 07    PRCP      3    30     0     0    36     3     0     0
##  4 ASN00086282  1970 08    TMAX    145   128   150   122   109   112   116   142
##  5 ASN00086282  1970 08    TMIN     50    61    75    67    41    51    48    -7
##  6 ASN00086282  1970 08    PRCP      0    66     0    53    13     3     8     0
##  7 ASN00086282  1970 09    TMAX    168   168   162   162   162   150   184   179
##  8 ASN00086282  1970 09    TMIN     19    29    62    81    81    55    73    97
##  9 ASN00086282  1970 09    PRCP      0     0     0     0     3     5     0    38
## 10 ASN00086282  1970 10    TMAX    189   194   204   267   256   228   237   144
## # … with 1,583 more rows

02:00

42/67

Tuberculosis notifications data taken from WHO 🤧

## # A tibble: 3,202 x 22
##    country  year new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 new_sp_m2534
##    <chr>   <dbl>      <dbl>       <dbl>       <dbl>        <dbl>        <dbl>
##  1 Afghan…  1997         NA          NA           0           10            6
##  2 Afghan…  1998         NA          NA          30          129          128
##  3 Afghan…  1999         NA          NA           8           55           55
##  4 Afghan…  2000         NA          NA          52          228          183
##  5 Afghan…  2001         NA          NA         129          379          349
##  6 Afghan…  2002         NA          NA          90          476          481
##  7 Afghan…  2003         NA          NA         127          511          436
##  8 Afghan…  2004         NA          NA         139          537          568
##  9 Afghan…  2005         NA          NA         151          606          560
## 10 Afghan…  2006         NA          NA         193          837          791
## # … with 3,192 more rows, and 15 more variables: new_sp_m3544 <dbl>,
## #   new_sp_m4554 <dbl>, new_sp_m5564 <dbl>, new_sp_m65 <dbl>, new_sp_mu <dbl>,
## #   new_sp_f04 <dbl>, new_sp_f514 <dbl>, new_sp_f014 <dbl>, new_sp_f1524 <dbl>,
## #   new_sp_f2534 <dbl>, new_sp_f3544 <dbl>, new_sp_f4554 <dbl>, new_sp_f5564 <dbl>,
## #   new_sp_f65 <dbl>, new_sp_fu <dbl>

02:00

43/67

French fries

10 week sensory experiment
12 individuals assessed taste of french fries on several scales (how potato-y, buttery, grassy, rancid, paint-y do they taste?)
fried in one of 3 different oils, replicated twice.

44/67

French fries: Variables? Observations?

## # A tibble: 696 x 9
##     time treatment subject   rep potato buttery grassy rancid painty
##    <dbl>     <dbl>   <dbl> <dbl>  <dbl>   <dbl>  <dbl>  <dbl>  <dbl>
##  1     1         1       3     1    2.9     0      0      0      5.5
##  2     1         1       3     2   14       0      0      1.1    0  
##  3     1         1      10     1   11       6.4    0      0      0  
##  4     1         1      10     2    9.9     5.9    2.9    2.2    0  
##  5     1         1      15     1    1.2     0.1    0      1.1    5.1
##  6     1         1      15     2    8.8     3      3.6    1.5    2.3
##  7     1         1      16     1    9       2.6    0.4    0.1    0.2
##  8     1         1      16     2    8.2     4.4    0.3    1.4    4  
##  9     1         1      19     1    7       3.2    0      4.9    3.2
## 10     1         1      19     2   13       0      3.1    4.3   10.3
## # … with 686 more rows

45/67

Rude Recliners data

data is collated from this story: 41% Of Fliers Think You're Rude If You Recline Your Seat
What are the variables?

## # A tibble: 3 x 6
##   V1         `V2:Always` `V2:Usually` `V2:About half the… `V2:Once in a wh… `V2:Never`
##   <chr>            <dbl>        <dbl>               <dbl>             <dbl>      <dbl>
## 1 No, not r…         124          145                  82               116         35
## 2 Yes, some…           9           27                  35               129         81
## 3 Yes, very…           3            3                  NA                11         54

46/67

Messy vs tidy

Messy data is messy in its own way. You can make unique solutions, but then another data set comes along, and you have to again make a unique solution.

Tidy data can be though of as legos. Once you have this form, you can put it together in so many different ways, to make different analyses.

47/67

Data Tidying verbspivot_longer: Specify the names_to (identifiers) and the values_to (measures) to make longer form data.
pivot_wider: Variables split out in columns
separate: Split one column into many
48/67

one more time: `pivot_longer`

pivot_longer(<DATA>,
             <COLS>,
             <NAMES_TO>
             <VALUES_TO>)

Cols to select are those that represent values, not variables.
names_to variable name for current column names.
values_to variable name whose values are spread over the cells.

49/67

pivot_longer: exampletable4a
## # A tibble: 3 x 3
##   country     `1999` `2000`
## * <chr>        <int>  <int>
## 1 Afghanistan    745   2666
## 2 Brazil       37737  80488
## 3 China       212258 213766

table4a %>% 
  pivot_longer(cols = c("1999", "2000"),
               names_to = "year",
               values_to = "cases")
## # A tibble: 6 x 3
##   country     year   cases
##   <chr>       <chr>  <int>
## 1 Afghanistan 1999     745
## 2 Afghanistan 2000    2666
## 3 Brazil      1999   37737
## 4 Brazil      2000   80488
## 5 China       1999  212258
## 6 China       2000  213766

50/67

Tidying genes data

Tell me what to put in the following?

cols are the columns that represent values, not variables.
names_to is the name of new variable whose values for the column names.
values_to is the name of the new variable whose values are spread over the cells.

## # A tibble: 3 x 12
##   id    `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2`
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1 Gene…      2.18     2.20       4.20     2.63       5.06       4.54       5.53
## 2 Gene…      1.46     0.585      1.86     0.515      2.88       1.36       2.96
## 3 Gene…      2.03     0.870      3.28     0.533      4.63       2.18       5.56
## # … with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
## #   `WM-12.R4` <dbl>

51/67

Tidy genes datagenes
## # A tibble: 3 x 12
##   id    `WI-6.R1` `WI-6.R2` `WI-6.R4` `WM-6.R1` `WM-6.R2` `WI-12.R1` `WI-12.R2`
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>      <dbl>
## 1 Gene…      2.18     2.20       4.20     2.63       5.06       4.54       5.53
## 2 Gene…      1.46     0.585      1.86     0.515      2.88       1.36       2.96
## 3 Gene…      2.03     0.870      3.28     0.533      4.63       2.18       5.56
## # … with 4 more variables: `WI-12.R4` <dbl>, `WM-12.R1` <dbl>, `WM-12.R2` <dbl>,
## #   `WM-12.R4` <dbl>

genes_long <- genes %>% 
  pivot_longer(cols = -id,
               names_to = "variable",
               values_to = "expr")
genes_long
## # A tibble: 33 x 3
##    id     variable  expr
##    <chr>  <chr>    <dbl>
##  1 Gene 1 WI-6.R1   2.18
##  2 Gene 1 WI-6.R2   2.20
##  3 Gene 1 WI-6.R4   4.20
##  4 Gene 1 WM-6.R1   2.63
##  5 Gene 1 WM-6.R2   5.06
##  6 Gene 1 WI-12.R1  4.54
##  7 Gene 1 WI-12.R2  5.53
##  8 Gene 1 WI-12.R4  4.41
##  9 Gene 1 WM-12.R1  3.85
## 10 Gene 1 WM-12.R2  4.18
## # … with 23 more rows

52/67

Separate columnsgenes_long
## # A tibble: 33 x 3
##    id     variable  expr
##    <chr>  <chr>    <dbl>
##  1 Gene 1 WI-6.R1   2.18
##  2 Gene 1 WI-6.R2   2.20
##  3 Gene 1 WI-6.R4   4.20
##  4 Gene 1 WM-6.R1   2.63
##  5 Gene 1 WM-6.R2   5.06
##  6 Gene 1 WI-12.R1  4.54
##  7 Gene 1 WI-12.R2  5.53
##  8 Gene 1 WI-12.R4  4.41
##  9 Gene 1 WM-12.R1  3.85
## 10 Gene 1 WM-12.R2  4.18
## # … with 23 more rows

genes_long %>%
  separate(col = variable, 
           into = c("trt", "leftover"),
           sep = "-")
## # A tibble: 33 x 4
##    id     trt   leftover  expr
##    <chr>  <chr> <chr>    <dbl>
##  1 Gene 1 WI    6.R1      2.18
##  2 Gene 1 WI    6.R2      2.20
##  3 Gene 1 WI    6.R4      4.20
##  4 Gene 1 WM    6.R1      2.63
##  5 Gene 1 WM    6.R2      5.06
##  6 Gene 1 WI    12.R1     4.54
##  7 Gene 1 WI    12.R2     5.53
##  8 Gene 1 WI    12.R4     4.41
##  9 Gene 1 WM    12.R1     3.85
## 10 Gene 1 WM    12.R2     4.18
## # … with 23 more rows

53/67

Separate columnsgenes_long_tidy <- genes_long %>%
  separate(variable, 
           into = c("trt", "leftover"), 
           sep = "-") %>%
  separate(leftover, 
           into = c("time", "rep"), 
           sep = "\\.")

genes_long_tidy
## # A tibble: 33 x 5
##    id     trt   time  rep    expr
##    <chr>  <chr> <chr> <chr> <dbl>
##  1 Gene 1 WI    6     R1     2.18
##  2 Gene 1 WI    6     R2     2.20
##  3 Gene 1 WI    6     R4     4.20
##  4 Gene 1 WM    6     R1     2.63
##  5 Gene 1 WM    6     R2     5.06
##  6 Gene 1 WI    12    R1     4.54
##  7 Gene 1 WI    12    R2     5.53
##  8 Gene 1 WI    12    R4     4.41
##  9 Gene 1 WM    12    R1     3.85
## 10 Gene 1 WM    12    R2     4.18
## # … with 23 more rows

54/67

Demo: koala bilby data

Here is a little data to practice pivot_longer, pivot_wider and separate on.

## # A tibble: 5 x 5
##   ID    koala_NSW koala_VIC bilby_NSW bilby_VIC
##   <chr>     <dbl>     <dbl>     <dbl>     <dbl>
## 1 grey         23        43        11         8
## 2 cream        56        89        22        17
## 3 white        35        72        13         6
## 4 black        28        44        19        16
## 5 taupe        25        37        21        12

55/67

Exercise: koala bilby dataRead over koala-bilby.Rmd
pivot_longer the data into long form, naming the two new variables, label and count
Separate the labels into two new variables, animal, state
pivot_wider the long form data into wide form, where the columns are the states. 
pivot_wider the long form data into wide form, where the columns are the animals. 
56/67

Exercise 1: Rude Recliners

Open rude-recliners.Rmd
This contains data from the article 41% Of Fliers Think You're Rude If You Recline Your Seat.
V1 is the response to question: "Is it rude to recline your seat on a plane?"
V2 is the response to question: "Do you ever recline your seat when you fly?".

## # A tibble: 3 x 6
##   V1         `V2:Always` `V2:Usually` `V2:About half the… `V2:Once in a wh… `V2:Never`
##   <chr>            <dbl>        <dbl>               <dbl>             <dbl>      <dbl>
## 1 No, not r…         124          145                  82               116         35
## 2 Yes, some…           9           27                  35               129         81
## 3 Yes, very…           3            3                  NA                11         54

57/67

Exercise 1: Rude Recliners (15 minutes)

Answer the following questions in the rude-recliners.Rmd rmarkdown document.

A) What are the variables and observations in this data?
1B) Put the data in tidy long form (using the names V2 as the key variable, and count as the value).
1C) Use the rename function to make the variable names a little shorter.

58/67

Exercise 1: Answers59/67

Your Turn: Turn to the people next to you and ask 2 questions:Are you more of a dog or a cat person?
What languages do you know how to speak?

03:00
60/67

Exercise 2: Tuberculosis Incidence data (15 minutes)

Open: tb-incidence.Rmd

Tidy the TB incidence data, using the Rmd to prompt questions.

61/67

Exercise 3: Currency rates (15 minutes)open currency-rates.Rmd
read in rates.csv
Answer the following questions:
What are the variables and observations?
pivot_longer the five currencies, AUD, GBP, JPY, CNY, CAD, make it into tidy long form.
Make line plots of the currencies, describe the similarities and differences between the currencies. 
62/67

Exercise 4: Australian Airport Passengers (optional!)

Open oz-airport.Rmd
Contains data from the web site Department of Infrastructure, Regional Development and Cities, containing data on Airport Traffic Data 1985–86 to 2017–18.
Read the dataset, into R, naming it passengers
Tidy the data, to produce a data set with these columns
- airport: all of the airports.
- year
- type_of_flight: DOMESTIC, INTERNATIONAL
- bound: IN or OUT

63/67

Recap

Traffic Light System: Green = "good!" ; Red = "Help!"
R + Rstudio
Functions are _
columns in data frames are accessed with _ ? If you have questions, place a red sticky note on your laptop.

If you are done, place a green sticky on your laptop

64/67

Lab quiz

Time to take the lab quiz.

65/67

A note on `pivot_wider` and `pivot_longer`, `gather` and `spread`

(Not needed to know for the course, but nice to know)

Naming things is hard
There are many ways to do the same thing in R
You might have come across pivot_ functions as spread or gather. These are still valid, but have been improved upon in the latest version of the tidyr package.
You can read more about this change here:
- tidyverse blog post
- tidyr vignette

66/67

COVID19 references

67/67

That's it!

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics
ETC1010.Clayton-x@monash.edu

16th Mar 2020

This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

67/67

Now let's use `pivot_wider` to examine different aspects

Examine treatments against each other

genes_long_tidy %>%
  pivot_wider(
    id_cols = c(id, rep, time),
    names_from = trt, 
    values_from = expr
    ) %>%
  ggplot(aes(x = WI, 
             y = WM, 
             colour = id)) + 
  geom_point()

Generally, some negative association within each gene, WM is low if WI is high.

Examine replicates against each other

genes_long_tidy %>%
  pivot_wider(id_cols = c(id, trt, time),
              names_from = rep, 
              values_from = expr) %>%
  ggplot(aes(x=R1, y=R4, colour=id)) + 
  geom_point() + coord_equal()

Roughly, replicate 4 is like replicate 1, eg if one is low, the other is low.

That's a good thing, that the replicates are fairly similar.

ETC1010: Introduction to Data Analysis

Week 2, part A

Week of Tidy Data

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

16th Mar 2020

1/67

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow

ETC1010: Introduction to Data Analysis

Week 2, part A

Week of Tidy Data

What is this song?

Quick Talk about COVID-19

What is all this

Symptoms

What can you do?

What can we do?

What does this mean for our class?

Recap

About your instructors

Nick

Steph

Sarah

Nitika

Sherry

Di

Your Turn: Making the groups

Your Turn: Ask your team mates these questions:

Traffic Light System

Traffic Light System

Red Post-it

Traffic Light System

Red Post-it

Green Post-it

Today: Outline

A note on difficulty

A note on difficulty

A note on difficulty

A note on difficulty

Tidy Data

Tidy Data

Example: US graduate programs

Example: US graduate programs

Example: US graduate programs

Example: US graduate programs

Example: US graduate programs

Example: US graduate programs

Example: US graduate programs

Example: US graduate programs

Your Turn: Open Lecture 2A in rstudio cloud

Terminology of data: Variable

Terminology of data: Observation

Terminology of data: Value

Tidy tabular form

Different examples of data

The grad program

Your Turn: Genes experiment 🤔

Melbourne weather 😨

Tuberculosis notifications data taken from WHO 🤧

French fries

French fries: Variables? Observations?

Rude Recliners data

Messy vs tidy

Data Tidying verbs

one more time: pivot_longer

pivot_longer: example

Tidying genes data

Tidy genes data

Separate columns

Separate columns

Demo: koala bilby data

Exercise: koala bilby data

Exercise 1: Rude Recliners

Exercise 1: Rude Recliners (15 minutes)

Exercise 1: Answers

Your Turn: Turn to the people next to you and ask 2 questions:

Exercise 2: Tuberculosis Incidence data (15 minutes)

Exercise 3: Currency rates (15 minutes)

Exercise 4: Australian Airport Passengers (optional!)

Recap

Lab quiz

A note on pivot_wider and pivot_longer, gather and spread

COVID19 references

That's it!

Now let's use pivot_wider to examine different aspects

Examine treatments against each other

Examine replicates against each other

ETC1010: Introduction to Data Analysis

one more time: `pivot_longer`

`pivot_longer`: example

A note on `pivot_wider` and `pivot_longer`, `gather` and `spread`

Now let's use `pivot_wider` to examine different aspects