<div class="shade_black"  style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;">
<i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>.
</div>

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70 title-slide
count: false

.column.shade_black[.content[

<br>

# .monash-blue.outline-text[ETC1010: Introduction to Data Analysis]

<br>

<h2 style="font-weight:900!important;">Classification Trees</h2>

.bottom_abs.width100[

Lecturer: *Professer Di Cook & Nicholas Tierney & Stuart Lee*

Department of Econometrics and Business Statistics

<span><i class="fas  fa-envelope faa-float animated "></i></span>  nicholas.tierney@monash.edu

May 2020

<br>
]

]]

---
# recap

- Decision Tree

---
# Admin

- Project
  - Use of data
  - Don't use kaggle datasets
  - Talk to us about your data in class and at consults
- Practical exam
  - Next Wednesday from 12pm Wednesday, closing 12pm Thursday

---
# What is a decision tree?

.pull-left[
Tree based models consist of one or more of nested `if-then` statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome.
]

.pull-right[
<img src="images/tree.jpg" width="100%" style="display: block; margin: auto;" />

.small[Source: [Egor Dezhic](becominghuman.ai)]

]

---
# Regression Tree

.pull-left[
<img src="lecture_10b_files/figure-html/reg-tree-split-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="lecture_10b_files/figure-html/show-split-1.png" width="100%" style="display: block; margin: auto;" />

]

---
# Regression Tree

.pull-left[
<img src="lecture_10b_files/figure-html/show-split-again-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
<img src="lecture_10b_files/figure-html/rpart-plot-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# Regression tree

- What if we want to predict something being in a particular group? Say,predicting whether someone passes a course based on two exam scores:
- Moving from continuous to categorical response.

---
# Regression? Classification?
  
- Regression trees give the predicted response for an observation by using the mean response of the observations that belong to the
same terminal node:

---
# Classification

A classification tree predicts each observation belonging to the most commonly occurring class of observations.

However, when we interpret a classification tree, we are often interested not only in the class prediction (what is most common), but also the proportion of correct classifications.

---
# Building a classification tree

- Similar approach to building a classification tree as for regression trees
- We use this "recursive binary splitting" approach
- But we don't use the residual sums of squares

$$
SS_T = \sum (y_i-\bar{y})^2
$$

Since we now have a category, we need some way to describe that.

We need something else!

---
# Classification tree

- We can use the "classification error".
- Where we count up the number of mis-classified things, and choose the split that has the lowest number of mis-classified things.
- We can represent this in an equation as the .orange[fraction of observations in a region which don't belong to the most common class].

$$
E = 1 - \text{max}_{k}(\hat{p}_{mk})
$$

Here,  `$\hat{p}_{mk}$` refers to the proportion of observations in the `$m$`th region, from the `$k$`th class.

---
# Understanding classification

Another way to think about this is to understand when E is zero, and when E is large

`$E = 1 - \text{max}_{k}(\hat{p}_{mk})$`

E is zero when `$\text{max}_{k}(\hat{p}_{mk})$` is 1, which is 1 when observations are the same class:

---
# Classification trees

- A classification tree is used to predict a .orange[categorical response] and regression tree is used to predict a quantitative response
- Use a recursive binary splitting to grow a classification tree. That is, sequentially break the data into two subsets, typically using a single variable each time.
- The predicted value for a new observation, `$x_0$`, will be the .orange[most commonly occurring class] of observations in the sub-region in which `$x_0$` falls

---
# Predicting pass or fail ?

Consider the dataset `Exam` where two exam scores are given for each student, 
and a class `Label` represents whether they passed or failed the course.

.pull-left[

```
##      Exam1    Exam2 Label
## 1 34.62366 78.02469     0
## 2 30.28671 43.89500     0
## 3 35.84741 72.90220     0
## 4 60.18260 86.30855     1
```
]

.pull-right[
<img src="lecture_10b_files/figure-html/unnamed-chunk-2-1.png" width="100%" style="display: block; margin: auto;" />
]

---
# Your turn:

Open "10b-exercise-intro.Rmd" and let's decide a point to split the data.

---
# Calculate the number of misclassifications

Along all splits for `Exam1` classifying according to the majority class for the left and right splits
 
<img src="gifs/two_d_cart.gif" width="80%" style="display: block; margin: auto;" />

Red dots are .orange["fails"], blue dots are .green["passes"], and crosses indicate misclassifications. .small[Source: John Ormerod, U.Syd]

---
# Calculate the number of misclassifications

Along all splits for `Exam2` classifying according to the majority class for the top and bottom splits

Red dots are .orange["fails"], blue dots are .green["passes"], and crosses indicate misclassifications. .small[Source: John Ormerod, U.Syd]

---
# Combining the results from `Exam1` and `Exam2` splits

- The minimum number of misclassifications from using all possible splits of `Exam1` was 19 when the value of `Exam1` was **56.7**
- The minimum number of misclassifications from using all possible splits of `Exam2` was 23 when the value of `Exam2` was .orange[52.5]

So we split on the best of these, i.e., split the data on `Exam1` at 56.7.
---
# Split criteria - purity/impurity metrics

It turns out that classification error is not sufficiently sensitive for tree-growing.

In practice two other measures are preferable, as they are more sensitive:

- The Gini Index and 
- Information Entropy.

They are both quite similar numerically.

Small values mean that a node contains mostly observations of a single class, referred to as .orange[node purity].

---
# Example - predicting heart disease

`$Y$`: presence of heart disease (Yes/No)

`$X$`: heart and lung function measurements

```
##  [1] "Age"       "Sex"       "ChestPain" "RestBP"    "Chol"      "Fbs"      
##  [7] "RestECG"   "MaxHR"     "ExAng"     "Oldpeak"   "Slope"     "Ca"       
## [13] "Thal"      "AHD"
```

---
# Deeper trees

Trees can be built deeper by:

- decreasing the value of the complexity parameter `cp`, which sets the difference between impurity values required to continue splitting.
- reducing  the `minsplit` and `minbucket` parameters,  which control the number of  observations  below splits are forbidden.

---
# Tabulate true vs predicted to make a .orange[confusion table].

<center>
<table>
<tr>  <td> </td><td> </td> <td colspan="2" align="center" > true </td> </tr>
<tr>  <td> </td><td> </td> <td align="center" bgcolor="#daf2e9" width="80px"> C1 (positive) </td> <td align="center" bgcolor="#daf2e9" width="80px"> C2 (negative) </td> </tr>
<tr height="50px">  <td> pred- </td><td bgcolor="#daf2e9"> C1 </td> <td align="center" bgcolor="#D3D3D3"> <em>a</em> </td> <td align="center" bgcolor="#D3D3D3"> <em>b</em> </td> </tr>
<tr height="50px">  <td>icted </td><td bgcolor="#daf2e9"> C2</td> <td align="center" bgcolor="#D3D3D3"> <em>c</em> </td> <td align="center" bgcolor="#D3D3D3"> <em>d</em> </td> </tr>
</table>
</center>

- .orange[Accuracy: *(a+d)/(a+b+c+d)*]
- .orange[Error: *(b+c)/(a+b+c+d)*]
- Sensitivity: *a/(a+c)*  (true positive, recall)
- Specificity: *d/(b+d)* (true negative)
- .orange[Balanced accuracy: *(sensitivity+specificity)/2*]

---
# Confusion and error

```
##           Reference
## Prediction No Yes
##        No  75   5
##        Yes 11  58
##  Accuracy 
## 0.8926174
```

---
# Example - Crabs

Physical measurements on WA crabs, males and females.

.small[*Data source*: Campbell, N. A. & Mahon, R. J. (1974)]

---
# Example - Crabs

---
# Comparing models

.pull-left[

Classification tree

<img src="lecture_10b_files/figure-html/unnamed-chunk-4-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[

Linear discriminant classifier

]

---
# Strengths and Weaknesses

Strengths:

- The decision rules provided by trees are very easy to explain, and follow. A simple classification model.
- Trees can handle a mix of predictor types, categorical and quantitative.
- Trees efficiently operate when there are missing values in the predictors.

Weaknesses:

- Algorithm is greedy, a better final solution might be obtained by taking a second best split earlier.
- When separation is in linear combinations of variables trees struggle to provide a good classification

---
# 👩‍💻 Made by a human with a computer

- Slides inspired by [https://iml.numbat.space](https://iml.numbat.space), [https://github.com/numbats/iml](https://github.com/numbats/iml).
- Created using [R Markdown](https://rmarkdown.rstudio.com) with flair by [**xaringan**](https://github.com/yihui/xaringan), and [**kunoichi** (female ninja) style](https://github.com/emitanaka/ninja-theme).

<a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/"><img alt="Creative Commons License" style="border-width:0" src="https://i.creativecommons.org/l/by-sa/4.0/88x31.png" /></a><br />This work is licensed under a <a rel="license" href="http://creativecommons.org/licenses/by-sa/4.0/">Creative Commons Attribution-ShareAlike 4.0 International License</a>.

Notes for current slide

Notes for next slide

These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See here for PDF .

Press the right arrow to progress to the next slide!

1/28

ETC1010: Introduction to Data Analysis

Week 10, part B

Classification Trees

Lecturer: Professer Di Cook & Nicholas Tierney & Stuart Lee

Department of Econometrics and Business Statistics

nicholas.tierney@monash.edu

May 2020

1/28

recapDecision Tree
2/28

AdminProjectUse of data
Don't use kaggle datasets
Talk to us about your data in class and at consults

Practical examNext Wednesday from 12pm Wednesday, closing 12pm Thursday

3/28

What is a decision tree?

Tree based models consist of one or more of nested if-then statements for the predictors that partition the data. Within these partitions, a model is used to predict the outcome.

Source: Egor Dezhic

4/28

Regression Tree

5/28

Regression Tree

6/28

Regression tree

What if we want to predict something being in a particular group? Say,predicting whether someone passes a course based on two exam scores:
Moving from continuous to categorical response.

7/28

Regression? Classification?

Regression trees give the predicted response for an observation by using the mean response of the observations that belong to the same terminal node:

8/28

Classification

A classification tree predicts each observation belonging to the most commonly occurring class of observations.

However, when we interpret a classification tree, we are often interested not only in the class prediction (what is most common), but also the proportion of correct classifications.

9/28

Building a classification tree

Similar approach to building a classification tree as for regression trees
We use this "recursive binary splitting" approach
But we don't use the residual sums of squares

Since we now have a category, we need some way to describe that.

We need something else!

10/28

Classification tree

We can use the "classification error".
Where we count up the number of mis-classified things, and choose the split that has the lowest number of mis-classified things.
We can represent this in an equation as the fraction of observations in a region which don't belong to the most common class.

$$ E = 1 - \text{max}{k}(\hat{p}{mk}) $$

Here, refers to the proportion of observations in the th region, from the th class.

11/28

Understanding classification

Another way to think about this is to understand when E is zero, and when E is large

E is zero when is 1, which is 1 when observations are the same class:

12/28

Classification treesA classification tree is used to predict a categorical response and regression tree is used to predict a quantitative response
Use a recursive binary splitting to grow a classification tree. That is, sequentially break the data into two subsets, typically using a single variable each time.
The predicted value for a new observation, x0, will be the most commonly occurring class of observations in the sub-region in which x0 falls
13/28

Predicting pass or fail ?

Consider the dataset Exam where two exam scores are given for each student, and a class Label represents whether they passed or failed the course.

##      Exam1    Exam2 Label
## 1 34.62366 78.02469     0
## 2 30.28671 43.89500     0
## 3 35.84741 72.90220     0
## 4 60.18260 86.30855     1

14/28

Your turn:

Open "10b-exercise-intro.Rmd" and let's decide a point to split the data.

15/28

Calculate the number of misclassifications

Along all splits for Exam1 classifying according to the majority class for the left and right splits

Red dots are "fails", blue dots are "passes", and crosses indicate misclassifications. Source: John Ormerod, U.Syd

16/28

Calculate the number of misclassifications

Along all splits for Exam2 classifying according to the majority class for the top and bottom splits

Red dots are "fails", blue dots are "passes", and crosses indicate misclassifications. Source: John Ormerod, U.Syd

17/28

Combining the results from Exam1 and Exam2 splitsThe minimum number of misclassifications from using all possible splits of Exam1 was 19 when the value of Exam1 was 56.7
The minimum number of misclassifications from using all possible splits of Exam2 was 23 when the value of Exam2 was 52.5
18/28

Combining the results from `Exam1` and `Exam2` splits

The minimum number of misclassifications from using all possible splits of Exam1 was 19 when the value of Exam1 was 56.7
The minimum number of misclassifications from using all possible splits of Exam2 was 23 when the value of Exam2 was 52.5

So we split on the best of these, i.e., split the data on Exam1 at 56.7.

18/28

Split criteria - purity/impurity metrics

It turns out that classification error is not sufficiently sensitive for tree-growing.

In practice two other measures are preferable, as they are more sensitive:

The Gini Index and
Information Entropy.

They are both quite similar numerically.

Small values mean that a node contains mostly observations of a single class, referred to as node purity.

19/28

Example - predicting heart disease

: presence of heart disease (Yes/No)

: heart and lung function measurements

##  [1] "Age"       "Sex"       "ChestPain" "RestBP"    "Chol"      "Fbs"      
##  [7] "RestECG"   "MaxHR"     "ExAng"     "Oldpeak"   "Slope"     "Ca"       
## [13] "Thal"      "AHD"

20/28

Deeper trees

Trees can be built deeper by:

decreasing the value of the complexity parameter cp, which sets the difference between impurity values required to continue splitting.
reducing the minsplit and minbucket parameters, which control the number of observations below splits are forbidden.

21/28

Tabulate true vs predicted to make a confusion table.

      true  
      C1 (positive)   C2 (negative)  
   pred-  C1   a   b  
  icted  C2  c   d  

Accuracy: (a+d)/(a+b+c+d)
Error: (b+c)/(a+b+c+d)
Sensitivity: a/(a+c)  (true positive, recall)
Specificity: d/(b+d) (true negative)
Balanced accuracy: (sensitivity+specificity)/2
22/28

Confusion and error

##           Reference
## Prediction No Yes
##        No  75   5
##        Yes 11  58
##  Accuracy 
## 0.8926174

23/28

Example - Crabs

Physical measurements on WA crabs, males and females.

Data source: Campbell, N. A. & Mahon, R. J. (1974)

24/28

Example - Crabs

25/28

Comparing models

Classification tree

Linear discriminant classifier

26/28

Strengths and Weaknesses

Strengths:

The decision rules provided by trees are very easy to explain, and follow. A simple classification model.
Trees can handle a mix of predictor types, categorical and quantitative.
Trees efficiently operate when there are missing values in the predictors.

Weaknesses:

Algorithm is greedy, a better final solution might be obtained by taking a second best split earlier.
When separation is in linear combinations of variables trees struggle to provide a good classification

27/28

👩‍💻 Made by a human with a computer

Slides inspired by https://iml.numbat.space, https://github.com/numbats/iml.
Created using R Markdown with flair by xaringan, and kunoichi (female ninja) style.

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

28/28

ETC1010: Introduction to Data Analysis

Week 10, part B

Classification Trees

Lecturer: Professer Di Cook & Nicholas Tierney & Stuart Lee

Department of Econometrics and Business Statistics

nicholas.tierney@monash.edu

May 2020

1/28

Paused

Help

Keyboard shortcuts

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Esc	Back to slideshow