+ - 0:00:00
Notes for current slide
Notes for next slide
These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for PDF .


Press the right arrow to progress to the next slide!

1/72


ETC1010: Introduction to Data Analysis

Week 5, part B


Week of introduction

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

April 2020


1/72

While the song is playing...

Draw a mental model / concept map of last lectures content on Missing Data.

2/72
3/72

Overview

  • Different file formats
    • Audio / binary
  • Web data
    • responsible scraping
    • scraping
    • JSON
4/72

Recap on some tricky topics

  • pipes %>% ("then")
  • assignment <- ("gets")
5/72

The pipe operator: %>%

Little bunny Foo Foo Went hopping through the forest Scooping up the field mice And bopping them on the head

6/72

Approach: Intermediate steps

foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
  • Main downside: forces you to name each intermediate element.
  • Sometimes these steps form natural names. If this is the case - go ahead.
  • But many times there are not natural names
  • Adding number suffixes to make the names unique leads to problems.
7/72

Approach: Intermediate steps

foo_foo_1 <- hop(foo_foo, through = forest)
foo_foo_2 <- scoop(foo_foo_1, up = field_mice)
foo_foo_3 <- bop(foo_foo_2, on = head)
  • Code is cluttered with unimportant names
  • Suffix has to be carefully incremented on each line.
  • I've done this!
  • 99% of the time I miss a number somewhere, and there goes my evening ... debugging my code.
8/72

Another Approach: Overwrite the original

foo_foo <- hop(foo_foo, through = forest)
foo_foo <- scoop(foo_foo, up = field_mice)
foo_foo <- bop(foo_foo, on = head)
  • Overwrite originals instead of creating intermediate objects
  • Less typing (and less thinking). Less likely to make mistakes?
  • Painful debugging: need to re-run the code from the top.
  • Repitition of object - (foo_foo written 6 times!) Obscures what changes.
9/72

(Yet) Another approach: function composition

bop(
scoop(
hop(foo_foo, through = forest),
up = field_mice
),
on = head
)
  • You need to read inside-out, and right-to-left.
  • Arguments are spread far apart
  • Harder to read
10/72

Pipe %>% can help!

f(x)

g(f(x))

h(g(f(x)))

11/72

Pipe %>% can help!

f(x)

g(f(x))

h(g(f(x)))

x %>% f()

x %>% f() %>% g()

x %>% f() %>% g() %>% h()

11/72

Solution: Use the pipe - %>%

foo_foo %>%
hop(through = forest) %>%
scoop(up = field_mice) %>%
bop(on = head)
  • focusses on verbs, not nouns.
  • Can be read as a series of function compositions like actions.

Foo Foo hops, then scoops, then bops.

12/72

Assignment <-

"gets"

13/72

Assignment

We can perform calculations in R:

1 + 1
read_csv("data.csv")
14/72

Assignment

But what if we want to use that information later?

1 + 1
read_csv("data.csv")
15/72

Assignment

We can assign these things to an object using <-

This reads as "gets".

x <- 1 + 1
my_data <- read_csv("data.csv")
16/72

Assignment

We can assign these things to an object using <-

This reads as "gets".

x <- 1 + 1
my_data <- read_csv("data.csv")
  • x 'gets' 1+1
  • my_data 'gets' the output of read_csv...
16/72

Assignment

Then we can use those things in other calculations

x <- 1 + 1
my_data <- read_csv("data.csv")
x * x
my_data %>%
select(age, height, weight) %>%
mutate(bmi = weight / height^2)
17/72

Take 3 minutes to discuss these two concepts with your breakout room

  • What are pipes %>%
  • What is assignment? <-
18/72

The many shapes and sizes of data

19/72

Data as an audio file

## # A tibble: 50,001 x 3
## t left right
## <int> <int> <int>
## 1 1 28 29
## 2 2 27 28
## 3 3 26 24
## 4 4 24 27
## 5 5 22 18
## 6 6 15 19
## 7 7 15 13
## 8 8 12 13
## 9 9 15 16
## 10 10 18 16
## # … with 49,991 more rows
20/72

Plotting audio data?

21/72

Compare left and right channels

22/72

Oh, same sound is on both channels! A tad drab.

Compute statistics

df_wavs %>%
filter(channel == "left") %>%
group_by(word) %>%
summarise(
m = mean(value),
s = sd(value),
mx = max(value),
mn = min(value)
) %>%
as_table()
word m s mx mn
data 0.004 1602.577 8393 -15386
word 0.009 1506.626 6601 -11026
23/72

Di's music

## artist type lvar lave lmax
## Dancing Queen Abba Rock 17600755.6 -90.0068667 29921
## Knowing Me Abba Rock 9543020.9 -75.7667187 27626
## Take a Chance Abba Rock 9049481.5 -98.0629244 26372
## Mamma Mia Abba Rock 7557437.3 -90.4710616 28898
## Lay All You Abba Rock 6282285.6 -88.9526309 27940
## Super Trouper Abba Rock 4665866.7 -69.0208450 25531
## I Have A Dream Abba Rock 3369670.3 -71.6828788 14699
## The Winner Abba Rock 1135862.0 -67.8190467 8928
## Money Abba Rock 6146942.6 -76.2807500 22962
## SOS Abba Rock 3482881.8 -74.1300038 15517
## V1 Vivaldi Classical 3677296.2 66.6530970 24229
## V2 Vivaldi Classical 771491.8 21.6560941 6936
## V3 Vivaldi Classical 5227573.2 88.6465556 17721
## V4 Vivaldi Classical 334719.1 13.8318847 4123
## V5 Vivaldi Classical 836904.9 34.5677377 17306
## V6 Vivaldi Classical 13936436.7 216.2317586 30355
## V7 Vivaldi Classical 3636324.3 9.8366363 21450
## V8 Vivaldi Classical 295397.4 2.8143227 2985
## V9 Vivaldi Classical 4335879.0 10.9015767 22271
## V10 Vivaldi Classical 472630.0 3.8890862 8194
## M1 Mozart Classical 2819795.0 -5.8667602 14939
## M2 Mozart Classical 2836957.9 -5.6074580 13382
## M3 Mozart Classical 9089372.1 -5.9719205 25265
## M4 Mozart Classical 4056229.7 -6.2272904 21328
## M5 Mozart Classical 1568925.6 -6.1993790 15839
## M6 Mozart Classical 7758409.1 -5.7700183 22496
## All in a Days Work Eels Rock 40275677.4 -10.6893416 32759
## Saturday Morning Eels Rock 129472199.3 50.2115773 32759
## The Good Old Days Eels Rock 18838849.0 -0.0222764 30386
## Love of the Loveless Eels Rock 43201194.3 1.8926897 32759
## Girl Eels Rock 88547131.0 0.3358761 32744
## Agony Eels Rock 16285811.4 -0.1405876 30106
## Rock Hard Times Eels Rock 54651552.8 1.9848514 32759
## Restraining Eels Rock 12322434.4 1.0259790 23221
## Lone Wolf Eels Rock 63878899.8 3.3592505 32759
## Wrong About Bobby Eels Rock 43668186.9 -2.0125667 32759
## Love Me Do Beatles Rock 28806811.2 -5.7452189 24159
## I Want to Hold Your Hand Beatles Rock 61257693.6 -6.0340682 28502
## Cant Buy Me Love Beatles Rock 76729438.0 -5.9583971 30102
## I Feel Fine Beatles Rock 52497242.7 -5.7314591 29911
## Ticket to Ride Beatles Rock 68104547.0 -6.1449114 30415
## Help Beatles Rock 52569372.5 -5.7166301 32318
## Yesterday Beatles Rock 23080907.6 -6.0298337 28169
## Yellow Submarine Beatles Rock 39908667.5 -6.2616915 29061
## Eleanor Rigby Beatles Rock 18819753.2 -6.1265193 21680
## Penny Lane Beatles Rock 58614798.7 -5.9971242 31131
## B1 Beethoven Classical 8368952.9 -0.9538330 26645
## B2 Beethoven Classical 293608.3 -0.1247094 4554
## B3 Beethoven Classical 8051764.6 -0.3316964 24194
## B4 Beethoven Classical 23493873.6 -0.9411538 32766
## B5 Beethoven Classical 1640232.8 1.3899979 20877
## B6 Beethoven Classical 343973.1 -2.4748955 9225
## B7 Beethoven Classical 3644784.2 -1.0426907 24633
## B8 Beethoven Classical 15030950.3 -1.4394652 26066
## The Memory of Trees Enya New wave 1135493.1 -10.6183398 9994
## Anywhere Is Enya New wave 12230252.2 -17.8372700 24968
## Pax Deorum Enya New wave 1723627.9 -6.8327065 13227
## Waterloo Abba Rock 24898675.9 -93.9961871 29830
## V11 Vivaldi Classical 1879989.2 12.7213373 8601
## V12 Vivaldi Classical 737349.6 5.7190022 7089
## V13 Vivaldi Classical 2865979.9 21.4467629 17282
## Hey Jude Beatles Rock 8651854.1 -6.1322408 18509
## lfener lfreq
## Dancing Queen 105.92095 59.57379
## Knowing Me 102.83616 58.48031
## Take a Chance 102.32488 124.59397
## Mamma Mia 101.61648 48.76513
## Lay All You 100.30076 74.02039
## Super Trouper 100.24848 81.40140
## I Have A Dream 104.59686 305.18689
## The Winner 104.34921 277.66056
## Money 102.24066 165.15799
## SOS 104.36243 146.73700
## V1 99.25243 329.53792
## V2 104.36737 843.83240
## V3 104.61255 165.76781
## V4 104.35005 293.99972
## V5 88.82821 198.38305
## V6 104.82354 198.46716
## V7 100.52727 877.77243
## V8 108.28717 58.41722
## V9 101.32881 176.53441
## V10 98.83731 526.04942
## M1 102.25358 342.26017
## M2 102.04646 511.85517
## M3 102.23796 429.27618
## M4 97.59887 343.75319
## M5 99.47580 288.44819
## M6 101.66942 459.24182
## All in a Days Work 106.07617 65.48281
## Saturday Morning 114.00229 41.40515
## The Good Old Days 105.45611 165.24210
## Love of the Loveless 108.37835 174.64185
## Girl 112.00916 392.28702
## Agony 103.94171 312.06322
## Rock Hard Times 108.87503 312.37864
## Restraining 103.29906 185.59771
## Lone Wolf 109.53291 66.13469
## Wrong About Bobby 108.35559 98.83404
## Love Me Do 109.95808 126.50757
## I Want to Hold Your Hand 111.81813 294.67263
## Cant Buy Me Love 112.71792 100.20089
## I Feel Fine 110.66333 110.39972
## Ticket to Ride 111.59957 107.30853
## Help 110.34618 137.10594
## Yesterday 107.30541 173.50631
## Yellow Submarine 109.47442 91.49508
## Eleanor Rigby 108.65894 164.92667
## Penny Lane 111.18901 92.46240
## B1 101.74095 354.12025
## B2 103.26160 316.03761
## B3 101.20132 397.18666
## B4 106.18220 529.26679
## B5 94.59029 445.00551
## B6 93.38874 331.68283
## B7 97.86394 181.01349
## B8 106.39731 249.08280
## The Memory of Trees 102.16132 155.06430
## Anywhere Is 105.75748 79.31957
## Pax Deorum 101.86845 49.01748
## Waterloo 107.73299 146.04306
## V11 105.81750 58.83780
## V12 102.92123 175.94562
## V13 102.11314 61.44533
## Hey Jude 83.88195 219.53773
24/72

Plot Di's music

25/72

Plot Di's Music

Abba is just different from everyone else!

26/72

Question time:

  • "How does data appear different than statistics in the time series?"
  • "What format is the data in an audio file?"
  • "How is Abba different from the other music clips?",
01:30
27/72

Why look at audio data?

  • Data comes in many shapes and sizes
  • Audio data can be transformed ("rectangled") into a data.frame
  • Another type of data is data on the web.
  • Extracting data from websites is called "web scraping".
28/72

Scraping the web: what? why?

  • Increasing amount of data is available on the web.
  • These data are provided in an unstructured format: you can always copy&paste, but it's time-consuming and prone to errors.
  • Web scraping is the process of extracting this information automatically and transform it into a structured dataset.
29/72

Scraping the web: what? why?

  1. Screen scraping: extract data from source code of website, with html parser (easy) or regular expression matching (less easy).
  2. Web APIs (application programming interface): website offers a set of structured http requests that return JSON or XML files.
  • Why R? It includes all tools necessary to do web scraping, familiarity, direct analysis of data... But python, perl, java are also efficient tools.
30/72

Web Scraping with rvest and polite

31/72

Hypertext Markup Language

Most of the data on the web is still largely available as HTML - while it is structured (hierarchical / tree based) it often is not available in a form useful for analysis (flat / tidy).

<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
32/72

What if we want to extract parts of this text out?

<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
33/72

What if we want to extract parts of this text out?

<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
  • read_html(): read HTML in (like read_csv and co!)
33/72

What if we want to extract parts of this text out?

<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
  • read_html(): read HTML in (like read_csv and co!)

  • html_nodes(): select specified nodes from the HTML document using CSS selectors.

33/72

Let's read it in with read_html

example <- read_html(here::here("slides/data/example.html"))
example
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <p align="center">Hello world!</p>\n </body>
34/72

Let's read it in with read_html

example <- read_html(here::here("slides/data/example.html"))
example
## {html_document}
## <html>
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body>\n <p align="center">Hello world!</p>\n </body>
  • We have two parts - head and body - which makes sense:
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
34/72

Now let's get the title

example %>%
html_nodes("title")
## {xml_nodeset (1)}
## [1] <title>This is a title</title>
35/72

Now let's get the title

example %>%
html_nodes("title")
## {xml_nodeset (1)}
## [1] <title>This is a title</title>
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
35/72

Now let's get the paragraph text

example %>%
html_nodes("p")
## {xml_nodeset (1)}
## [1] <p align="center">Hello world!</p>
36/72

Now let's get the paragraph text

example %>%
html_nodes("p")
## {xml_nodeset (1)}
## [1] <p align="center">Hello world!</p>
<html>
<head>
<title>This is a title</title>
</head>
<body>
<p align="center">Hello world!</p>
</body>
</html>
36/72

Rough summary

  • read_html - read in a html file
  • html_nodes - select the parts of the html file we want to look at
    • This requires knowing about the website structure
    • But it turns out website are much...much more complicated than out little example file
37/72

rvest + polite:

Simplify processing and manipulating HTML data

  • bow() - check if the data can be scraped appropriately
  • scrape() - scrape website data (with nice defaults)
  • html_nodes() - select specified nodes from the HTML document using CSS selectors.
  • html_table - parse an HTML table into a data frame.
  • html_text - extract tag pairs' content.
38/72

SelectorGadget: css selectors

  • Using a tool called selector gadget to help identify the html elements of interest
  • Does this by constructing a css selector which can be used to subset the html document.
39/72

SelectorGadget: css selectors

Selector Example Description
element p Select all <p> elements
element element div p Select all <p> elements inside a <div> element
element>element div > p Select all <p> elements with <div> as a parent
.class .title Select all elements with class="title"
#id .name Select all elements with id="name"
[attribute] [class] Select all elements with a class attribute
[attribute=value] [class=title] Select all elements with class="title"
40/72

SelectorGadget

  • SelectorGadget: Open source tool that eases CSS selector generation and discovery
  • Install the Chrome Extension
  • A box will open in the bottom right of the website. Click on a page element that you would like your selector to match (it will turn green). SelectorGadget will then generate a minimal CSS selector for that element, and will highlight (yellow) everything that is matched by the selector.
  • Now click on a highlighted element to remove it from the selector (red), or click on an unhighlighted element to add it to the selector. Through this process of selection and rejection, SelectorGadget helps you come up with the appropriate CSS selector for your needs.
41/72

Top 250 movies on IMDB

Take a look at the source code, look for the tag table tag:
http://www.imdb.com/chart/top

42/72

First check to make sure you're allowed!

# install.packages("polite")
library(polite)
bow("http://www.imdb.com")
## <polite session> http://www.imdb.com
## User-agent: polite R package - https://github.com/dmi3kno/polite
## robots.txt: 26 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
43/72

First check to make sure you're allowed!

# install.packages("polite")
library(polite)
bow("http://www.imdb.com")
## <polite session> http://www.imdb.com
## User-agent: polite R package - https://github.com/dmi3kno/polite
## robots.txt: 26 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
bow("http://www.facebook.com")
## <polite session> http://www.facebook.com
## User-agent: polite R package - https://github.com/dmi3kno/polite
## robots.txt: 438 rules are defined for 20 bots
## Crawl delay: 5 sec
## The path is not scrapable for this user-agent
43/72

Join in

  • Go to rstudio.cloud ida-exercise-5a
  • If you want to use R / Rstudio on your laptop:
# install.packages("usethis")
library(usethis)
use_course("https://ida.numbat.space/exercises/5b/ida-exercise-5b.zip")
44/72

Demo

Let's go to http://www.imdb.com/chart/top

45/72

Bow and scrape

imdb_session <- bow("http://www.imdb.com/chart/top")
imdb_session
## <polite session> http://www.imdb.com/chart/top
## User-agent: polite R package - https://github.com/dmi3kno/polite
## robots.txt: 26 rules are defined for 1 bots
## Crawl delay: 5 sec
## The path is scrapable for this user-agent
imdb_data <- scrape(imdb_session)
imdb_data
## {html_document}
## <html xmlns:og="http://ogp.me/ns#" xmlns:fb="http://www.facebook.com/2008/fbml">
## [1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
## [2] <body id="styleguide-v2" class="fixed">\n <img height="1" widt ...
46/72

Select and format pieces: titles - html_nodes()

library(rvest)
imdb_data %>%
html_nodes(".titleColumn a")
## {xml_nodeset (250)}
## [1] <a href="/title/tt0111161/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [2] <a href="/title/tt0068646/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [3] <a href="/title/tt0071562/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [4] <a href="/title/tt0468569/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [5] <a href="/title/tt0050083/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [6] <a href="/title/tt0108052/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [7] <a href="/title/tt0167260/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [8] <a href="/title/tt0110912/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [9] <a href="/title/tt0060196/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [10] <a href="/title/tt0120737/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [11] <a href="/title/tt0137523/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [12] <a href="/title/tt0109830/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [13] <a href="/title/tt1375666/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [14] <a href="/title/tt0080684/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [15] <a href="/title/tt0167261/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [16] <a href="/title/tt0133093/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [17] <a href="/title/tt0099685/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [18] <a href="/title/tt0073486/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [19] <a href="/title/tt0047478/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## [20] <a href="/title/tt0114369/?pf_rd_m=A2FGELUUNOQJNL&amp;pf_rd_p=e31d89dd-3 ...
## ...
47/72

Select and format pieces: titles - html_text()

imdb_data %>%
html_nodes(".titleColumn a") %>%
html_text()
## [1] "The Shawshank Redemption"
## [2] "The Godfather"
## [3] "The Godfather: Part II"
## [4] "The Dark Knight"
## [5] "12 Angry Men"
## [6] "Schindler's List"
## [7] "The Lord of the Rings: The Return of the King"
## [8] "Pulp Fiction"
## [9] "The Good, the Bad and the Ugly"
## [10] "The Lord of the Rings: The Fellowship of the Ring"
## [11] "Fight Club"
## [12] "Forrest Gump"
## [13] "Inception"
## [14] "Star Wars: Episode V - The Empire Strikes Back"
## [15] "The Lord of the Rings: The Two Towers"
## [16] "The Matrix"
## [17] "Goodfellas"
## [18] "One Flew Over the Cuckoo's Nest"
## [19] "Seven Samurai"
## [20] "Se7en"
## [21] "City of God"
## [22] "Life Is Beautiful"
## [23] "The Silence of the Lambs"
## [24] "It's a Wonderful Life"
## [25] "Star Wars"
## [26] "Parasite"
## [27] "Saving Private Ryan"
## [28] "Spirited Away"
## [29] "The Green Mile"
## [30] "Interstellar"
## [31] "Leon: The Professional"
## [32] "The Usual Suspects"
## [33] "Seppuku"
## [34] "The Lion King"
## [35] "American History X"
## [36] "The Pianist"
## [37] "Terminator 2: Judgment Day"
## [38] "Back to the Future"
## [39] "Modern Times"
## [40] "Psycho"
## [41] "Gladiator"
## [42] "City Lights"
## [43] "The Intouchables"
## [44] "The Departed"
## [45] "Whiplash"
## [46] "The Prestige"
## [47] "Once Upon a Time in the West"
## [48] "Grave of the Fireflies"
## [49] "Casablanca"
## [50] "Joker"
## [51] "Cinema Paradiso"
## [52] "Rear Window"
## [53] "Alien"
## [54] "Apocalypse Now"
## [55] "Memento"
## [56] "Raiders of the Lost Ark"
## [57] "The Great Dictator"
## [58] "The Lives of Others"
## [59] "Django Unchained"
## [60] "Paths of Glory"
## [61] "The Shining"
## [62] "Avengers: Infinity War"
## [63] "WALL·E"
## [64] "Sunset Boulevard"
## [65] "Spider-Man: Into the Spider-Verse"
## [66] "Princess Mononoke"
## [67] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb"
## [68] "Oldboy"
## [69] "Witness for the Prosecution"
## [70] "Avengers: Endgame"
## [71] "The Dark Knight Rises"
## [72] "Once Upon a Time in America"
## [73] "Aliens"
## [74] "Your Name."
## [75] "Coco"
## [76] "American Beauty"
## [77] "Braveheart"
## [78] "1917"
## [79] "Das Boot"
## [80] "3 Idiots"
## [81] "Toy Story"
## [82] "Tengoku to jigoku"
## [83] "Taare Zameen Par"
## [84] "Star Wars: Episode VI - Return of the Jedi"
## [85] "Amadeus"
## [86] "Reservoir Dogs"
## [87] "Inglourious Basterds"
## [88] "Good Will Hunting"
## [89] "2001: A Space Odyssey"
## [90] "Requiem for a Dream"
## [91] "Vertigo"
## [92] "M - Eine Stadt sucht einen Mörder"
## [93] "Dangal"
## [94] "Eternal Sunshine of the Spotless Mind"
## [95] "Citizen Kane"
## [96] "The Hunt"
## [97] "Capharnaüm"
## [98] "Full Metal Jacket"
## [99] "North by Northwest"
## [100] "A Clockwork Orange"
## [101] "Snatch"
## [102] "The Kid"
## [103] "Bicycle Thieves"
## [104] "Singin' in the Rain"
## [105] "Scarface"
## [106] "Taxi Driver"
## [107] "Amelie"
## [108] "Lawrence of Arabia"
## [109] "The Sting"
## [110] "Toy Story 3"
## [111] "Metropolis"
## [112] "For a Few Dollars More"
## [113] "Ikiru"
## [114] "Jodaeiye Nader az Simin"
## [115] "Double Indemnity"
## [116] "The Apartment"
## [117] "To Kill a Mockingbird"
## [118] "Incendies"
## [119] "Indiana Jones and the Last Crusade"
## [120] "Up"
## [121] "L.A. Confidential"
## [122] "Monty Python and the Holy Grail"
## [123] "Heat"
## [124] "Rashomon"
## [125] "Die Hard"
## [126] "Yojimbo"
## [127] "Batman Begins"
## [128] "Green Book"
## [129] "Downfall"
## [130] "Unforgiven"
## [131] "Idi i smotri"
## [132] "Bacheha-Ye aseman"
## [133] "Some Like It Hot"
## [134] "Howl's Moving Castle"
## [135] "Ran"
## [136] "The Great Escape"
## [137] "All About Eve"
## [138] "A Beautiful Mind"
## [139] "Casino"
## [140] "Pan's Labyrinth"
## [141] "My Neighbor Totoro"
## [142] "El secreto de sus ojos"
## [143] "Raging Bull"
## [144] "Lock, Stock and Two Smoking Barrels"
## [145] "The Wolf of Wall Street"
## [146] "The Treasure of the Sierra Madre"
## [147] "Judgment at Nuremberg"
## [148] "There Will Be Blood"
## [149] "Babam ve Oglum"
## [150] "Three Billboards Outside Ebbing, Missouri"
## [151] "The Gold Rush"
## [152] "Chinatown"
## [153] "Dial M for Murder"
## [154] "V for Vendetta"
## [155] "Det sjunde inseglet"
## [156] "Inside Out"
## [157] "No Country for Old Men"
## [158] "Warrior"
## [159] "Shutter Island"
## [160] "Trainspotting"
## [161] "The Elephant Man"
## [162] "The Sixth Sense"
## [163] "The Thing"
## [164] "Room"
## [165] "Gone with the Wind"
## [166] "Jurassic Park"
## [167] "Blade Runner"
## [168] "The Bridge on the River Kwai"
## [169] "Smultronstället"
## [170] "Finding Nemo"
## [171] "The Third Man"
## [172] "On the Waterfront"
## [173] "Stalker"
## [174] "Fargo"
## [175] "Kill Bill: Vol. 1"
## [176] "The Truman Show"
## [177] "Gran Torino"
## [178] "Tôkyô monogatari"
## [179] "The Deer Hunter"
## [180] "Memories of Murder"
## [181] "Relatos salvajes"
## [182] "Eskiya"
## [183] "Andhadhun"
## [184] "Klaus"
## [185] "The Big Lebowski"
## [186] "Mary and Max"
## [187] "In the Name of the Father"
## [188] "Gone Girl"
## [189] "Hacksaw Ridge"
## [190] "The Grand Budapest Hotel"
## [191] "Ford v Ferrari"
## [192] "Persona"
## [193] "Mr. Smith Goes to Washington"
## [194] "How to Train Your Dragon"
## [195] "Before Sunrise"
## [196] "Catch Me If You Can"
## [197] "The General"
## [198] "Sherlock Jr."
## [199] "Prisoners"
## [200] "12 Years a Slave"
## [201] "Cool Hand Luke"
## [202] "Mad Max: Fury Road"
## [203] "Network"
## [204] "Stand by Me"
## [205] "Le salaire de la peur"
## [206] "Barry Lyndon"
## [207] "Into the Wild"
## [208] "Million Dollar Baby"
## [209] "Monty Python's Life of Brian"
## [210] "Platoon"
## [211] "Hachi: A Dog's Tale"
## [212] "Ben-Hur"
## [213] "Rush"
## [214] "Logan"
## [215] "La passion de Jeanne d'Arc"
## [216] "Andrei Rublev"
## [217] "Harry Potter and the Deathly Hallows: Part 2"
## [218] "Dead Poets Society"
## [219] "Les quatre cents coups"
## [220] "Rang De Basanti"
## [221] "Hotel Rwanda"
## [222] "Amores perros"
## [223] "Kaze no tani no Naushika"
## [224] "Spotlight"
## [225] "Ah-ga-ssi"
## [226] "Rebecca"
## [227] "Rocky"
## [228] "Portrait of a Lady on Fire"
## [229] "Monsters, Inc."
## [230] "La haine"
## [231] "It Happened One Night"
## [232] "Faa yeung nin wa"
## [233] "Gangs of Wasseypur"
## [234] "Before Sunset"
## [235] "The Princess Bride"
## [236] "The Help"
## [237] "Ace in the Hole"
## [238] "Paris, Texas"
## [239] "The Invisible Guest"
## [240] "The Red Shoes"
## [241] "Drishyam"
## [242] "The Terminator"
## [243] "Lagaan: Once Upon a Time in India"
## [244] "Butch Cassidy and the Sundance Kid"
## [245] "Akira"
## [246] "Aladdin"
## [247] "PK"
## [248] "Kis Uykusu"
## [249] "Fanny och Alexander"
## [250] "Throne of Blood"
48/72

Select and format pieces: save it

titles <- imdb_data %>%
html_nodes(".titleColumn a") %>%
html_text()
49/72

Select and format pieces: years - nodes

imdb_data %>%
html_nodes(".secondaryInfo")
## {xml_nodeset (250)}
## [1] <span class="secondaryInfo">(1994)</span>
## [2] <span class="secondaryInfo">(1972)</span>
## [3] <span class="secondaryInfo">(1974)</span>
## [4] <span class="secondaryInfo">(2008)</span>
## [5] <span class="secondaryInfo">(1957)</span>
## [6] <span class="secondaryInfo">(1993)</span>
## [7] <span class="secondaryInfo">(2003)</span>
## [8] <span class="secondaryInfo">(1994)</span>
## [9] <span class="secondaryInfo">(1966)</span>
## [10] <span class="secondaryInfo">(2001)</span>
## [11] <span class="secondaryInfo">(1999)</span>
## [12] <span class="secondaryInfo">(1994)</span>
## [13] <span class="secondaryInfo">(2010)</span>
## [14] <span class="secondaryInfo">(1980)</span>
## [15] <span class="secondaryInfo">(2002)</span>
## [16] <span class="secondaryInfo">(1999)</span>
## [17] <span class="secondaryInfo">(1990)</span>
## [18] <span class="secondaryInfo">(1975)</span>
## [19] <span class="secondaryInfo">(1954)</span>
## [20] <span class="secondaryInfo">(1995)</span>
## ...
50/72

Select and format pieces: years - text

imdb_data %>%
html_nodes(".secondaryInfo") %>%
html_text()
## [1] "(1994)" "(1972)" "(1974)" "(2008)" "(1957)" "(1993)" "(2003)" "(1994)"
## [9] "(1966)" "(2001)" "(1999)" "(1994)" "(2010)" "(1980)" "(2002)" "(1999)"
## [17] "(1990)" "(1975)" "(1954)" "(1995)" "(2002)" "(1997)" "(1991)" "(1946)"
## [25] "(1977)" "(2019)" "(1998)" "(2001)" "(1999)" "(2014)" "(1994)" "(1995)"
## [33] "(1962)" "(1994)" "(1998)" "(2002)" "(1991)" "(1985)" "(1936)" "(1960)"
## [41] "(2000)" "(1931)" "(2011)" "(2006)" "(2014)" "(2006)" "(1968)" "(1988)"
## [49] "(1942)" "(2019)" "(1988)" "(1954)" "(1979)" "(1979)" "(2000)" "(1981)"
## [57] "(1940)" "(2006)" "(2012)" "(1957)" "(1980)" "(2018)" "(2008)" "(1950)"
## [65] "(2018)" "(1997)" "(1964)" "(2003)" "(1957)" "(2019)" "(2012)" "(1984)"
## [73] "(1986)" "(2016)" "(2017)" "(1999)" "(1995)" "(2019)" "(1981)" "(2009)"
## [81] "(1995)" "(1963)" "(2007)" "(1983)" "(1984)" "(1992)" "(2009)" "(1997)"
## [89] "(1968)" "(2000)" "(1958)" "(1931)" "(2016)" "(2004)" "(1941)" "(2012)"
## [97] "(2018)" "(1987)" "(1959)" "(1971)" "(2000)" "(1921)" "(1948)" "(1952)"
## [105] "(1983)" "(1976)" "(2001)" "(1962)" "(1973)" "(2010)" "(1927)" "(1965)"
## [113] "(1952)" "(2011)" "(1944)" "(1960)" "(1962)" "(2010)" "(1989)" "(2009)"
## [121] "(1997)" "(1975)" "(1995)" "(1950)" "(1988)" "(1961)" "(2005)" "(2018)"
## [129] "(2004)" "(1992)" "(1985)" "(1997)" "(1959)" "(2004)" "(1985)" "(1963)"
## [137] "(1950)" "(2001)" "(1995)" "(2006)" "(1988)" "(2009)" "(1980)" "(1998)"
## [145] "(2013)" "(1948)" "(1961)" "(2007)" "(2005)" "(2017)" "(1925)" "(1974)"
## [153] "(1954)" "(2005)" "(1957)" "(2015)" "(2007)" "(2011)" "(2010)" "(1996)"
## [161] "(1980)" "(1999)" "(1982)" "(2015)" "(1939)" "(1993)" "(1982)" "(1957)"
## [169] "(1957)" "(2003)" "(1949)" "(1954)" "(1979)" "(1996)" "(2003)" "(1998)"
## [177] "(2008)" "(1953)" "(1978)" "(2003)" "(2014)" "(1996)" "(2018)" "(2019)"
## [185] "(1998)" "(2009)" "(1993)" "(2014)" "(2016)" "(2014)" "(2019)" "(1966)"
## [193] "(1939)" "(2010)" "(1995)" "(2002)" "(1926)" "(1924)" "(2013)" "(2013)"
## [201] "(1967)" "(2015)" "(1976)" "(1986)" "(1953)" "(1975)" "(2007)" "(2004)"
## [209] "(1979)" "(1986)" "(2009)" "(1959)" "(2013)" "(2017)" "(1928)" "(1966)"
## [217] "(2011)" "(1989)" "(1959)" "(2006)" "(2004)" "(2000)" "(1984)" "(2015)"
## [225] "(2016)" "(1940)" "(1976)" "(2019)" "(2001)" "(1995)" "(1934)" "(2000)"
## [233] "(2012)" "(2004)" "(1987)" "(2011)" "(1951)" "(1984)" "(2016)" "(1948)"
## [241] "(2015)" "(1984)" "(2001)" "(1969)" "(1988)" "(1992)" "(2014)" "(2014)"
## [249] "(1982)" "(1957)"
51/72

Select and format pieces: years - remove-brackets

imdb_data %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_remove("\\(") %>% # remove (
str_remove("\\)") %>% # remove )
as.numeric()
## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 1980 2002
## [16] 1999 1990 1975 1954 1995 2002 1997 1991 1946 1977 2019 1998 2001 1999 2014
## [31] 1994 1995 1962 1994 1998 2002 1991 1985 1936 1960 2000 1931 2011 2006 2014
## [46] 2006 1968 1988 1942 2019 1988 1954 1979 1979 2000 1981 1940 2006 2012 1957
## [61] 1980 2018 2008 1950 2018 1997 1964 2003 1957 2019 2012 1984 1986 2016 2017
## [76] 1999 1995 2019 1981 2009 1995 1963 2007 1983 1984 1992 2009 1997 1968 2000
## [91] 1958 1931 2016 2004 1941 2012 2018 1987 1959 1971 2000 1921 1948 1952 1983
## [106] 1976 2001 1962 1973 2010 1927 1965 1952 2011 1944 1960 1962 2010 1989 2009
## [121] 1997 1975 1995 1950 1988 1961 2005 2018 2004 1992 1985 1997 1959 2004 1985
## [136] 1963 1950 2001 1995 2006 1988 2009 1980 1998 2013 1948 1961 2007 2005 2017
## [151] 1925 1974 1954 2005 1957 2015 2007 2011 2010 1996 1980 1999 1982 2015 1939
## [166] 1993 1982 1957 1957 2003 1949 1954 1979 1996 2003 1998 2008 1953 1978 2003
## [181] 2014 1996 2018 2019 1998 2009 1993 2014 2016 2014 2019 1966 1939 2010 1995
## [196] 2002 1926 1924 2013 2013 1967 2015 1976 1986 1953 1975 2007 2004 1979 1986
## [211] 2009 1959 2013 2017 1928 1966 2011 1989 1959 2006 2004 2000 1984 2015 2016
## [226] 1940 1976 2019 2001 1995 1934 2000 2012 2004 1987 2011 1951 1984 2016 1948
## [241] 2015 1984 2001 1969 1988 1992 2014 2014 1982 1957
52/72

Select and format pieces: years - parse_number()

imdb_data %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
parse_number()
## [1] 1994 1972 1974 2008 1957 1993 2003 1994 1966 2001 1999 1994 2010 1980 2002
## [16] 1999 1990 1975 1954 1995 2002 1997 1991 1946 1977 2019 1998 2001 1999 2014
## [31] 1994 1995 1962 1994 1998 2002 1991 1985 1936 1960 2000 1931 2011 2006 2014
## [46] 2006 1968 1988 1942 2019 1988 1954 1979 1979 2000 1981 1940 2006 2012 1957
## [61] 1980 2018 2008 1950 2018 1997 1964 2003 1957 2019 2012 1984 1986 2016 2017
## [76] 1999 1995 2019 1981 2009 1995 1963 2007 1983 1984 1992 2009 1997 1968 2000
## [91] 1958 1931 2016 2004 1941 2012 2018 1987 1959 1971 2000 1921 1948 1952 1983
## [106] 1976 2001 1962 1973 2010 1927 1965 1952 2011 1944 1960 1962 2010 1989 2009
## [121] 1997 1975 1995 1950 1988 1961 2005 2018 2004 1992 1985 1997 1959 2004 1985
## [136] 1963 1950 2001 1995 2006 1988 2009 1980 1998 2013 1948 1961 2007 2005 2017
## [151] 1925 1974 1954 2005 1957 2015 2007 2011 2010 1996 1980 1999 1982 2015 1939
## [166] 1993 1982 1957 1957 2003 1949 1954 1979 1996 2003 1998 2008 1953 1978 2003
## [181] 2014 1996 2018 2019 1998 2009 1993 2014 2016 2014 2019 1966 1939 2010 1995
## [196] 2002 1926 1924 2013 2013 1967 2015 1976 1986 1953 1975 2007 2004 1979 1986
## [211] 2009 1959 2013 2017 1928 1966 2011 1989 1959 2006 2004 2000 1984 2015 2016
## [226] 1940 1976 2019 2001 1995 1934 2000 2012 2004 1987 2011 1951 1984 2016 1948
## [241] 2015 1984 2001 1969 1988 1992 2014 2014 1982 1957
53/72

Select and format pieces: years - remove-brackets

years <- imdb_data %>%
html_nodes(".secondaryInfo") %>%
html_text() %>%
str_remove("\\(") %>% # remove (
str_remove("\\)") %>% # remove )
as.numeric()
54/72

Select and format pieces: scores - nodes

imdb_data %>%
html_nodes(".imdbRating strong")
## {xml_nodeset (250)}
## [1] <strong title="9.2 based on 2,223,821 user ratings">9.2</strong>
## [2] <strong title="9.1 based on 1,532,457 user ratings">9.1</strong>
## [3] <strong title="9.0 based on 1,072,718 user ratings">9.0</strong>
## [4] <strong title="9.0 based on 2,198,206 user ratings">9.0</strong>
## [5] <strong title="8.9 based on 649,936 user ratings">8.9</strong>
## [6] <strong title="8.9 based on 1,158,141 user ratings">8.9</strong>
## [7] <strong title="8.9 based on 1,575,650 user ratings">8.9</strong>
## [8] <strong title="8.8 based on 1,744,462 user ratings">8.8</strong>
## [9] <strong title="8.8 based on 657,898 user ratings">8.8</strong>
## [10] <strong title="8.8 based on 1,588,379 user ratings">8.8</strong>
## [11] <strong title="8.8 based on 1,772,374 user ratings">8.8</strong>
## [12] <strong title="8.8 based on 1,715,479 user ratings">8.8</strong>
## [13] <strong title="8.7 based on 1,950,012 user ratings">8.7</strong>
## [14] <strong title="8.7 based on 1,111,785 user ratings">8.7</strong>
## [15] <strong title="8.7 based on 1,423,084 user ratings">8.7</strong>
## [16] <strong title="8.6 based on 1,597,852 user ratings">8.6</strong>
## [17] <strong title="8.6 based on 967,824 user ratings">8.6</strong>
## [18] <strong title="8.6 based on 874,402 user ratings">8.6</strong>
## [19] <strong title="8.6 based on 300,677 user ratings">8.6</strong>
## [20] <strong title="8.6 based on 1,369,370 user ratings">8.6</strong>
## ...
55/72

Select and format pieces: scores - text

imdb_data %>%
html_nodes(".imdbRating strong") %>%
html_text()
## [1] "9.2" "9.1" "9.0" "9.0" "8.9" "8.9" "8.9" "8.8" "8.8" "8.8" "8.8" "8.8"
## [13] "8.7" "8.7" "8.7" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6" "8.6"
## [25] "8.6" "8.6" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5"
## [37] "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.5" "8.4"
## [49] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4"
## [61] "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.4" "8.3" "8.3" "8.3"
## [73] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [85] "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3" "8.3"
## [97] "8.3" "8.3" "8.3" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [109] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [121] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [133] "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2" "8.2"
## [145] "8.2" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
## [157] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
## [169] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
## [181] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
## [193] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1"
## [205] "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.1" "8.0" "8.0" "8.0"
## [217] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0"
## [229] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0"
## [241] "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0" "8.0"
56/72

Select and format pieces: scores - as-numeric

imdb_data %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
## [1] 9.2 9.1 9.0 9.0 8.9 8.9 8.9 8.8 8.8 8.8 8.8 8.8 8.7 8.7 8.7 8.6 8.6 8.6
## [19] 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.6 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5
## [37] 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.5 8.4 8.4 8.4 8.4 8.4 8.4 8.4
## [55] 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.4 8.3 8.3 8.3
## [73] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3
## [91] 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.3 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [109] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [127] 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2 8.2
## [145] 8.2 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [163] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [181] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1
## [199] 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.1 8.0 8.0 8.0
## [217] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
## [235] 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0 8.0
57/72

Select and format pieces: scores - save

scores <- imdb_data %>%
html_nodes(".imdbRating strong") %>%
html_text() %>%
as.numeric()
58/72

Select and format pieces: put it all together

imdb_top_250 <- tibble(title = titles,
year = years,
score = scores)
imdb_top_250
## # A tibble: 250 x 3
## title year score
## <chr> <dbl> <dbl>
## 1 The Shawshank Redemption 1994 9.2
## 2 The Godfather 1972 9.1
## 3 The Godfather: Part II 1974 9
## 4 The Dark Knight 2008 9
## 5 12 Angry Men 1957 8.9
## 6 Schindler's List 1993 8.9
## 7 The Lord of the Rings: The Return of the King 2003 8.9
## 8 Pulp Fiction 1994 8.8
## 9 The Good, the Bad and the Ugly 1966 8.8
## 10 The Lord of the Rings: The Fellowship of the Ring 2001 8.8
## # … with 240 more rows
59/72
title year score
The Shawshank Redemption 1994 9.2
The Godfather 1972 9.1
The Godfather: Part II 1974 9
The Dark Knight 2008 9
12 Angry Men 1957 8.9
Schindler's List 1993 8.9
The Lord of the Rings: The Return of the King 2003 8.9
Pulp Fiction 1994 8.8
The Good, the Bad and the Ugly 1966 8.8
... ... ...
60/72

Aside: Yet another approach - pull the table with html_table()

  • requires notation we haven't used yet (e.g., what is [[]])
  • requires substantial text cleaning
  • If there is time we can cover this at the end of class
imdb_table <- html_table(imdb_data)
glimpse(imdb_table[[1]])
## Rows: 250
## Columns: 5
## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
## $ `Rank & Title` <chr> "1.\n The Shawshank Redemption\n (1994)", …
## $ `IMDb Rating` <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8,…
## $ `Your Rating` <chr> "12345678910\n \n \n \n …
## $ `` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
61/72

Clean up / enhance

May or may not be a lot of work depending on how messy the data are

  • See if you like what you got:
glimpse(imdb_top_250)
## Rows: 250
## Columns: 3
## $ title <chr> "The Shawshank Redemption", "The Godfather", "The Godfather: Pa…
## $ year <dbl> 1994, 1972, 1974, 2008, 1957, 1993, 2003, 1994, 1966, 2001, 199…
## $ score <dbl> 9.2, 9.1, 9.0, 9.0, 8.9, 8.9, 8.9, 8.8, 8.8, 8.8, 8.8, 8.8, 8.7…
62/72

Clean up / enhance

  • Add a variable for rank
imdb_top_250 %>%
mutate(
rank = 1:nrow(imdb_top_250)
)
## # A tibble: 250 x 4
## title year score rank
## <chr> <dbl> <dbl> <int>
## 1 The Shawshank Redemption 1994 9.2 1
## 2 The Godfather 1972 9.1 2
## 3 The Godfather: Part II 1974 9 3
## 4 The Dark Knight 2008 9 4
## 5 12 Angry Men 1957 8.9 5
## 6 Schindler's List 1993 8.9 6
## 7 The Lord of the Rings: The Return of the King 2003 8.9 7
## 8 Pulp Fiction 1994 8.8 8
## 9 The Good, the Bad and the Ugly 1966 8.8 9
## 10 The Lord of the Rings: The Fellowship of the Ring 2001 8.8 10
## # … with 240 more rows
63/72
title year score rank
The Shawshank Redemption 1994 9.2 1
The Godfather 1972 9.1 2
The Godfather: Part II 1974 9 3
The Dark Knight 2008 9 4
12 Angry Men 1957 8.9 5
Schindler's List 1993 8.9 6
The Lord of the Rings: The Return of the King 2003 8.9 7
Pulp Fiction 1994 8.8 8
The Good, the Bad and the Ugly 1966 8.8 9
... ... ... ...
64/72

Your Turn: Which 1995 movies made the list?

65/72

Your Turn: Which 1995 movies made the list?

imdb_top_250 %>%
filter(year == 1995)
## # A tibble: 8 x 3
## title year score
## <chr> <dbl> <dbl>
## 1 Se7en 1995 8.6
## 2 The Usual Suspects 1995 8.5
## 3 Braveheart 1995 8.3
## 4 Toy Story 1995 8.3
## 5 Heat 1995 8.2
## 6 Casino 1995 8.2
## 7 Before Sunrise 1995 8.1
## 8 La haine 1995 8
65/72

Your turn: Which years have the most movies on the list?

66/72

Your turn: Which years have the most movies on the list?

imdb_top_250 %>%
group_by(year) %>%
summarise(total = n()) %>%
arrange(desc(total)) %>%
head(5)
## # A tibble: 5 x 2
## year total
## <dbl> <int>
## 1 1995 8
## 2 1957 7
## 3 2014 7
## 4 2019 7
## 5 2000 6
66/72

Your Turn: Visualize top 250 yearly mean score over time

67/72

Your Turn: Visualize top 250 yearly mean score over time

imdb_top_250 %>%
group_by(year) %>%
summarise(avg_score = mean(score)) %>%
ggplot(aes(y = avg_score, x = year)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("year")
67/72

68/72

Other common formats: JSON

  • JavaScript Object Notation (JSON).
  • A language-independent data format, and supplants extensible markup language (XML).
  • Data are sometimes stored as JSON, which requires special unpacking
69/72

Unpacking JSON: Example JSON from jsonlite

library(jsonlite)
json_mario <- '[
{
"Name": "Mario",
"Age": 32,
"Occupation": "Plumber"
},
{
"Name": "Peach",
"Age": 21,
"Occupation": "Princess"
},
{},
{
"Name": "Bowser",
"Occupation": "Koopa"
}
]'
mydf <- fromJSON(json_mario)
mydf
## Name Age Occupation
## 1 Mario 32 Plumber
## 2 Peach 21 Princess
## 3 <NA> NA <NA>
## 4 Bowser NA Koopa
70/72

Potential challenges with web scraping

  • Unreliable formatting at the source
  • Data broken into many pages
  • Data arriving in multiple excel file formats
  • ... We will come back to this when we learn about functions next week.

Compare the display of information at gumtree melbourne to the list on the IMDB top 250 list. What challenges can you foresee in scraping a list of the available apartments? ]

71/72

Further exploring

People write R packages to access online data! Check out:

72/72


ETC1010: Introduction to Data Analysis

Week 5, part B


Week of introduction

Lecturer: Nicholas Tierney

Department of Econometrics and Business Statistics

ETC1010.Clayton-x@monash.edu

April 2020


1/72
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow