ETC1010: Introduction to Data Analysis

<div class="shade_black"  style="width:60%;right:0;bottom:0;padding:10px;border: dashed 4px white;margin: auto;">
<i class="fas fa-exclamation-circle"></i> These slides are viewed best by Chrome and occasionally need to be refreshed if elements did not load properly. See <a href=/>here for PDF <i class="fas fa-file-pdf"></i></a>.
</div>

<br>

.white[Press the **right arrow** to progress to the next slide!]

---

background-image: url(images/bg1.jpg)
background-size: cover
class: hide-slide-number split-70 title-slide
count: false

.column.shade_black[.content[

<br>

# .monash-blue.outline-text[ETC1010: Introduction to Data Analysis]

<br>

<h2 style="font-weight:900!important;">Regression and Decision Trees</h2>

.bottom_abs.width100[

Lecturer: *Nicholas Tierney & Stuart Lee*

Department of Econometrics and Business Statistics

<span><i class="fas  fa-envelope faa-float animated "></i></span>  nicholas.tierney@monash.edu

May 2020

<br>
]

]]

---
# recap

- networks

---
# Overview

- What is a regression tree?
- How is it computed?
- Deciding when its a good fit
  - rmse
- Comparison with linear models
- Using multiple variables
- Next class:
  - How a classification tree differs from a regression tree?

---
# Example

---
# Let's predict Y using a linear model

```r
df_lm <- lm(y ~ x, df)
```

---
# Assessing model fit

- Look at residuals
- Look at mean square error

---
# Looking at the residuals: this is bad!

It basically looks like the data!

---
# Looking at the Mean square error (MSE)

This is another way to assess a model, it is like taking the average amount of error in the model.

$$
MSE(y) = \frac{\sum_{i = 1}^{i = N} (y_i - \hat{y}_i)^2}{N}
$$
In R code:

```r
mse <- function(model){
  mod_aug <- augment(model)
  mod_aug %>% 
    mutate(res_2 = .resid^2) %>% 
    summarise(mse = mean(res_2)) %>% 
    pull(mse)
}

mse(df_lm)
## [1] 3.216767
```

---
# Let's use a different model: "rpart"

```r
library(rpart)
# df_lm <- lm(y~x, data=df) - similar to lm! But rpart.
df_rp <- rpart(y~x, data=df)
```

---
# Comparing lm vs rpart: Predictions

---
# Comparing lm vs rpart: MSE

```r
# linear model
mse(df_lm)
## [1] 3.216767

# rpart model
mse(df_rp)
## [1] 0.4517498
```

The rpart model is much lower!

We can look at the residuals plotted against the values of x to get an idea

---
# Comparing lm vs rpart: residuals

---
# Comparing lm vs rpart: output

```
## 
## Call:
## lm(formula = y ~ x, data = df)
## 
## Coefficients:
## (Intercept)            x  
##      0.8806      -2.2165
```

---
# Comparing lm vs rpart: output

```
## n= 100 
## 
## node), split, n, deviance, yval
##       * denotes terminal node
## 
##  1) root 100 359.245100  0.8081071  
##    2) x>=0.2775916 24  16.840100 -1.4822830  
##      4) x>=0.3817438 12   3.832238 -2.0814410 *
##      5) x< 0.3817438 12   4.392090 -0.8831252 *
##    3) x< 0.2775916 76 176.745400  1.5313880  
##      6) x< 0.1426085 61  41.562800  0.9365995  
##       12) x>=-0.3999242 50  24.519860  0.7035330  
##         24) x< 0.05905847 41  11.729940  0.4807175  
##           48) x>=-0.1455513 25   5.653876  0.2281914 *
##           49) x< -0.1455513 16   1.990829  0.8752895 *
##         25) x>=0.05905847 9   1.481498  1.7185820 *
##       13) x< -0.3999242 11   1.981477  1.9959930 *
##      7) x>=0.1426085 15  25.842970  3.9501960 *
```

---
# So what is going on?

- A linear model asks "What line fits through these points, to minimise the error"?
- A decision tree model asks "How can I best break the data into segments, to minimize some error?"

---
# A linear model: draws the line of best fit

---
# A regression tree: segments the data to reduce mean error

---
# Regression trees

- Regression trees recursively partition the data, and use the average response value of each partition as the model estimate
- It is a computationally intense technique that examines all possible partitions, and choosing the BEST partition by optimizing some criteria
- For regression, with a quantitative response variable, the criteria to maximise is called ANOVA:

`$$SS_T-(SS_L+SS_R)$$`
where `$SS_T = \sum (y_i-\bar{y})^2$`, and `$SS_L, SS_R$` are the equivalent values for the two subsets created by partitioning.

---
# Break down: What is `$SS_T = \sum (y_i-\bar{y})^2$` ?

---
# Break down: What is `$SS_T = \sum (y_i-\bar{y})^2$` ?

---