+ - 0:00:00
Notes for current slide
Notes for next slide
These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for PDF .


Press the right arrow to progress to the next slide!

1/54


ETC1010: Introduction to Data Analysis

Week 1, part B


Week of introduction

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

11th Mar 2020


1/54

From Jessica Ward (@JKRWard) of R Ladies Newcaslte (UK) - @RLadiesNCL https://twitter.com/RLadiesNCL/status/1138812826917724160

2/54
3/54
4/54

R essentials: A short list (for now)

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
5/54

R essentials: A short list (for now)

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
5/54

R essentials: A short list (for now)

  • Functions are (most often) verbs, followed by what they will be applied to in parentheses:
do_this(to_this)
  • Columns (variables) in data frames are accessed with $:
dataframe$var_name
  • Packages are installed with install.packages, and loaded with library , once per session:
install.packages("package_name")
library(package_name)
5/54

Today: Outline

  • Why we care about Reproducibility
  • R + markdown = Rmarkdown
  • Controling output and input of rmarkdown
  • Exercises on creating rmarkdown reports on the humble platypus
  • Form up assignment groups
  • Quiz
  • Assignment 1 released next week.
6/54

We are in a tight spot with reproducibility

Only 6 out of 53 landmark results could be reproduced

-- Amgen, 2014*

7/54

We are in a tight spot with reproducibility

Only 6 out of 53 landmark results could be reproduced

-- Amgen, 2014*

An estimated 75% - 90% of preclinical results cannot be reproduced

-- Begley, 2015*

7/54

We are in a tight spot with reproducibility

Only 6 out of 53 landmark results could be reproduced

-- Amgen, 2014*

An estimated 75% - 90% of preclinical results cannot be reproduced

-- Begley, 2015*

Estimated annual cost of irreproducibility for biomedical industry = 28 Billion USD

-- Freedman, 2015*

* Heard via Garret Grolemund's great talk

7/54
8/54
9/54
10/54

So what can we do about it?

11/54

Reproducibility checklist

Near-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)
12/54

Reproducibility checklist

Near-term goals:

  • Are the tables and figures reproducible from the code and data?
  • Does the code actually do what you think it does?
  • In addition to what was done, is it clear why it was done? (e.g., how were parameter settings chosen?)

Long-term goals:

  • Can the code be used for other data?
  • Can you extend the code to do other things?
12/54

Literate programming is a partial solution

  • Literate programming shines some light on this dark area of science.

  • An idea from Donald Knuth where you combine your text with your code output to create a document.

  • A blend of your literature (text), and your programming (code), to create something you can read from top to bottom.

13/54

So

14/54

So

Imagine a report:

Introduction, methods, results, discussion, and conclusion,

14/54

So

Imagine a report:

Introduction, methods, results, discussion, and conclusion,

All the bits of code that make each section.

14/54

So

Imagine a report:

Introduction, methods, results, discussion, and conclusion,

All the bits of code that make each section.

With rmarkdown, you can see all the pieces of your data analysis all together.

14/54

So

Imagine a report:

Introduction, methods, results, discussion, and conclusion,

All the bits of code that make each section.

With rmarkdown, you can see all the pieces of your data analysis all together.

Each time you knit the analysis is ran from the beginning

14/54

Markdown as a new player to legibility

15/54

Markdown as a new player to legibility

In 2004, John Gruber, of daring fireball created markdown, a simple way to create text that rendered into a HTML webpage.

15/54

Markdown as a new player to legibility

- bullet list
- bullet list
- bullet list
16/54

Markdown as a new player to legibility

- bullet list
- bullet list
- bullet list
  • bullet list
  • bullet list
  • bullet list
16/54

Markdown as a new player to legibility

1. numbered list
2. numbered list
3. numbered list
__bold__, **bold**,
_italic_, *italic*
> quote of something profound
17/54

Markdown as a new player to legibility

1. numbered list
2. numbered list
3. numbered list
__bold__, **bold**,
_italic_, *italic*
> quote of something profound
  1. numbered list
  2. numbered list
  3. numbered list

bold, bold,

italic, italic

quote of something profound

17/54

Markdown as a new player to legibility

With very little marking up, we can create rich text, that actually resembles the text that we want to see.

18/54

Markdown as a new player to legibility

With very little marking up, we can create rich text, that actually resembles the text that we want to see.

Learn to use markdown Spend five minutes working through markdowntutorial.com

05:00
18/54

Markdown as a new player to legibility

With very little marking up, we can create rich text, that actually resembles the text that we want to see.

Learn to use markdown Spend five minutes working through markdowntutorial.com

05:00

This the end of part 1 of lecture 1b

18/54

Start of part 2 of lecture 1b

19/54

Rmarkdown helps address the reproducibility problem

  • Q: How do we take markdown + R code = "literate programming environment"

  • A: Rmarkdown

20/54

Rmarkdown...

21/54

Rmarkdown...

Provides an environment where you can write your complete analysis, and marries your text, and code together into a rich document.

21/54

Rmarkdown...

Provides an environment where you can write your complete analysis, and marries your text, and code together into a rich document.

You write your code as code chunks, put your text around that, and then hey presto, you have a document you can reproduce.

21/54

Reminder: You've already used rmarkdown!

22/54

How will we use R Markdown?

  • Every assignment + project / is an R Markdown document
  • You'll always have a template R Markdown document to start with
  • The amount of scaffolding in the template will decrease over the semester
  • These lecture notes are created using R Markdown (!)
23/54

The anatomy of an rmarkdown document

There are three parts to an rmarkdown document.

  • Metadata (YAML)
  • Text (markdown formatting)
  • Code (code formatting)
24/54

The anatomy of an rmarkdown document

There are three parts to an rmarkdown document.

  • Metadata (YAML)
  • Text (markdown formatting)
  • Code (code formatting)

DEMO

24/54

Metadata: YAML (YAML Ain't Markup Language)

  • The metadata of the document tells you how it is formed - what the title is, what date to put, and other control information.

  • If you're familiar with LATEX, this is similar to how you specify document type, styles, fonts, options, etc in the front matter / preamble.

25/54

Metadata: YAML

  • Rmarkdown documents use YAML to provide the metadata. It looks like this:
---
title: "An example document"
author: "Nicholas Tierney"
output: html_document
---

It starts an ends with three dashes ---, and has fields like the following: title, author, and output.

26/54

Text

Is markdown, as we discussed in the earlier section,

It provides a simple way to mark up text

1. bullet list
2. bullet list
3. bullet list
  1. bullet list
  2. bullet list
  3. bullet list
27/54

Code

We refer to code in an rmarkdown document in two ways:

  1. Code chunks, and
  2. Inline code.
28/54

Code: Code chunks

Code chunks are marked by three backticks and curly braces with r inside them:

```{r chunk-name}
# a code chunk
```
29/54

Code: backtick?

A backtick is a special character you might not have seen before, it is typically located under the tilde key (~). On USA / Australia keyboards, is under the escape key:

30/54

Code: Inline code

Sometimes you want to run the code inside a sentence. This is called running the code "inline".

31/54

Code: Inline code

Sometimes you want to run the code inside a sentence. This is called running the code "inline".

You might want to run the code inline to name the number of variables or rows in a dataset in a sentence like:

There are XXX observations in the airquality dataset, and XXX variables.

31/54

Code: Inline code

You can call code "inline" like so:

There are `r nrow(airquality) ` observations in the airquality dataset,
and `r ncol(airquality) ` variables.

Which gives you the following sentence

32/54

Code: Inline code

You can call code "inline" like so:

There are `r nrow(airquality) ` observations in the airquality dataset,
and `r ncol(airquality) ` variables.

Which gives you the following sentence

There are 153 observations in the airquality dataset, and 6 variables.

32/54

Code: Inline code

  • If your data changes upstream
33/54

Code: Inline code

  • If your data changes upstream

  • You don't need to work out where you mentioned your data

33/54

Code: Inline code

  • If your data changes upstream

  • You don't need to work out where you mentioned your data

  • You just update the document. 🎉

33/54

Your Turn: Put it together

Go to rstudio.cloud and go to "ida-exercise-1b"

  • open the document "01-oz-atlas.Rmd"
  • knit the document
  • Change the data section at the top to be from a different state instead of "New South Wales"
  • knit the document again
  • How do the text and figures in the document change?
05:00
34/54
35/54

End of part 2 of Lecture 1B

Make sure you finish the exercise on the rstudio.cloud

36/54

Start of part 3 of Lecture 1B

37/54

Code: Chunk names

Straight after the ```{r you can use a text string to name the chunk:

```{r read-crime-data}
crime <- read_csv("data/crime-data.csv")
```
38/54

Code: Chunk names

Naming code chunks has three advantages:

  1. Navigate to specific chunks using the drop-down code navigator in the bottom-left of the script editor.
  2. Graphics produced by chunks now have useful names.
  3. You can set up networks of cached chunks to avoid re-performing expensive computations on every run.
39/54

Code: Chunk names

Every chunk should ideally have a name.

Naming things is hard, but follow these rules and you'll be fine:

  1. One word that describes the action (e.g., "read")
  2. One word that describes the thing inside the code (e.g, "gapminder")
  3. Separate words with "-" (e.g., read-gapminder)
40/54

Code: Chunk options

You can control how the code is output by changing the code chunk options which follow the title.

```{r read-gapminder, eval = FALSE, echo = TRUE}
gap <- read_csv("gapminder.csv")
```

What do you think this does?

00:30
41/54

Code: Chunk options

The code chunk options you need to know about right now are:

  • cache: TRUE / FALSE. Do you want to save the output of the chunk so it doesn't have to run next time?
  • eval: TRUE / FALSE Do you want to evaluate the code?
  • echo: TRUE / FALSE Do you want to print the code?
  • include: TRUE / FALSE Do you want to include code output in the final output document? Setting to FALSE means nothing is put into the output document, but the code is still run.

You can read more about the options at the official documentation: https://yihui.name/knitr/options/#code-evaluation

42/54

Your turn

  • go to rstudio.cloud, open document 01-oz-atlas.Rmd and change the document so that the code output is hidden, but the graphics are shown. (Hint: Google "rstudio rmarkdown cheatsheet" for some tips!)
  • Re-Knit the document.
  • Take a look at the R Markdown Gallery.
05:00
43/54

End of Part 3 of Lecture 1B

44/54

Start of Part 4 of Lecture 1B

45/54

Global options: Set and forget

You can set the default chunk behaviour once at the top of the .Rmd file using a chunk like:

knitr::opts_chunk$set(
echo = FALSE,
cache = TRUE
)

then you will only need to add chunk options when you have the occasional one that you'd like to behave differently.

46/54

Your turn

  • Go to your 01-oz-atlas.Rmd document on rstudio.cloud and change the global settings at the top of the rmarkdown document to echo = FALSE, and cache = TRUE
knitr::opts_chunk$set(
echo = FALSE,
cache = TRUE
)
  • Update the other code chunks by removing the code chunk options.
05:00
47/54

End of part 4 of lecture 1b

48/54

Start of part 5 of lecture 1b

49/54

DEMO

The many different outputs of rmarkdown

50/54

Your turn: Different types of documents

  1. Change the output of your current R Markdown file to produce a Word document. Now try to produce pdf - this may not work! That's OK, we do'nt need it right now.
  2. Create a new document that will produce a slide show File > New R Markdown > Presentation
  3. Create a flexdashboard document - see this option in the File > New R Markdown > From template list.
51/54

Recap:

  • There is a Reproducibility Crisis
  • rmarkdown = YAML + text + code
  • rmarkdown has many different output types
  • Platypus are interesting!
  • Assignment will be announced next week
52/54

Learning more:

  • R Markdown cheat sheet and Markdown Quick Reference (Help -> Markdown Quick Reference) handy, we'll refer to it often as the course progresses
53/54

Your Turn

  • Go to rstudio.cloud to oz-atlas-final.Rmd
  • Follow prompts and questions to learn more about the Australian native platypus - (copy content over from previous doc)
  • Take the Ed lab quiz for today.
54/54



That's it!

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics
ETC1010.Clayton-x@monash.edu

11th Mar 2020


This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Creative Commons License

54/54
  • Reproducibility: Why we care
  • Rmarkdown
    • YAML
    • Code
    • text
  • markdown (online quiz)
  • rmarkdown - edit the existing one on platypus!
  • code chunks
  • code chunk names
  • chunk options
  • exercise on this
  • setting different chunk options globally
  • exercise extending the platypus report
  • make assignment groups
  • release assignment
  • quiz

Should be able to answer the questions:

How should I start an rmarkdown document? What do I put in the YAML metadata? How do I create a code chunk? What sort of options to I need to worry about for my code? What is the value in a reproducible report? What is markdown? Can I combine my software and my writing?


ETC1010: Introduction to Data Analysis

Week 1, part B


Week of introduction

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

11th Mar 2020


1/54
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow