Intro to tidymodels

<div class="my-footer"><a href='https://jthomasmock.github.io/model-penguins/'>https://jthomasmock.github.io/model-penguins/</a></div>

---

# <span style='color:white;'>Intro to `tidymodels`</span>

## <span style='color:white;'>Tom Mock: [`@thomas_mock`](https://twitter.com/thomas_mock)</span>

### 2021-12-16

<span style='color:white; font-size:110%;'>Slides: [jthomasmock.github.io/model-penguins/](https://jthomasmock.github.io/model-penguins/#1)</span>

<span style='color:white; font-size:110%;'>Content: [github.com/jthomasmock/model-penguins](https://github.com/jthomasmock/model-penguins)</span>

<span style='color:white;'>Slides released under</span> [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/)&nbsp;&nbsp;<svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M245.83 214.87l-33.22 17.28c-9.43-19.58-25.24-19.93-27.46-19.93-22.13 0-33.22 14.61-33.22 43.84 0 23.57 9.21 43.84 33.22 43.84 14.47 0 24.65-7.09 30.57-21.26l30.55 15.5c-6.17 11.51-25.69 38.98-65.1 38.98-22.6 0-73.96-10.32-73.96-77.05 0-58.69 43-77.06 72.63-77.06 30.72-.01 52.7 11.95 65.99 35.86zm143.05 0l-32.78 17.28c-9.5-19.77-25.72-19.93-27.9-19.93-22.14 0-33.22 14.61-33.22 43.84 0 23.55 9.23 43.84 33.22 43.84 14.45 0 24.65-7.09 30.54-21.26l31 15.5c-2.1 3.75-21.39 38.98-65.09 38.98-22.69 0-73.96-9.87-73.96-77.05 0-58.67 42.97-77.06 72.63-77.06 30.71-.01 52.58 11.95 65.56 35.86zM247.56 8.05C104.74 8.05 0 123.11 0 256.05c0 138.49 113.6 248 247.56 248 129.93 0 248.44-100.87 248.44-248 0-137.87-106.62-248-248.44-248zm.87 450.81c-112.54 0-203.7-93.04-203.7-202.81 0-105.42 85.43-203.27 203.72-203.27 112.53 0 202.82 89.46 202.82 203.26-.01 121.69-99.68 202.82-202.84 202.82z"/></svg><svg aria-hidden="true" role="img" viewBox="0 0 496 512" style="height:1em;width:0.97em;vertical-align:-0.125em;margin-left:auto;margin-right:auto;font-size:inherit;fill:white;overflow:visible;position:relative;"><path d="M314.9 194.4v101.4h-28.3v120.5h-77.1V295.9h-28.3V194.4c0-4.4 1.6-8.2 4.6-11.3 3.1-3.1 6.9-4.7 11.3-4.7H299c4.1 0 7.8 1.6 11.1 4.7 3.1 3.2 4.8 6.9 4.8 11.3zm-101.5-63.7c0-23.3 11.5-35 34.5-35s34.5 11.7 34.5 35c0 23-11.5 34.5-34.5 34.5s-34.5-11.5-34.5-34.5zM247.6 8C389.4 8 496 118.1 496 256c0 147.1-118.5 248-248.4 248C113.6 504 0 394.5 0 256 0 123.1 104.7 8 247.6 8zm.8 44.7C130.2 52.7 44.7 150.6 44.7 256c0 109.8 91.2 202.8 203.7 202.8 103.2 0 202.8-81.1 202.8-202.8.1-113.8-90.2-203.3-202.8-203.3z"/></svg> ]

---

### Focus for Today

60 Minutes

Binary classification:  
- Logistic Regression  
- Random Forest

---

### Level-Setting

As much as I'd love to learn and teach *all* of Machine Learning/Statistics in 60 min...

It's just not possible!

### Goals for today

* Make you comfortable with the **syntax** and **packages** via `tidymodels` unified interface
* So when you're learning or modeling on your own, you get to focus on the **stats** rather than re-learning different APIs over and over...

Along the way, we'll cover minimal examples and then some more quick best practices where `tidymodels` makes it easier to do more things!

---

# `tidymodels`

`tidymodels` is a collection of packages for modeling and machine learning using `tidyverse` principles.

## Packages
* `rsample`: efficient data splitting and resampling  
* `parsnip`: tidy, unified interface to models  
* `recipes`: tidy interface to data pre-processing tools for feature engineering  
* `workflows`: bundles your pre-processing, modeling, and post-processing  
* `tune`: helps optimize the hyperparameters and pre-processing steps  
* `yardstick`: measures the performance metrics  
* `dials`: creates and manages tuning parameters/grids  
* `broom`: converts common R statistical objects into predictable formats  
  * [`broom` available methods](https://broom.tidymodels.org/articles/available-methods.html)

---
class:inverse, center, middle

# Tidy Machine Learning w/ `tidymodels`
<img src="data:image/png;base64,#https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidymodels.svg" width="25%" />

---

### Core ideas for Today

A workflow for `tidy` machine learning

* Split the data  
* Pre-Process and Choose a Model  
* Combine into a Workflow  
* Generate Predictions and Assess Model Metrics

---

### Classification

Showing two examples today, comparing their outcomes, and then giving you the chance to explore on your own!

---

### The Dataset

Palmer penguins dataset - a modern replacement for R's iris dataset
--

#### The goal: Binary classification of sex for the various penguin species

```r
glimpse(palmerpenguins::penguins)
```

```
## Rows: 344
## Columns: 8
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ island            <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, …
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, …
## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186…
## $ body_mass_g       <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, …
## $ sex               <fct> male, female, female, NA, female, male, female, male…
## $ year              <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007…
```

---

### Penguins

```r
penguins %>%
  filter(!is.na(sex)) %>%
  ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) +
  geom_point(alpha = 0.5) + facet_wrap(~species) +
  scale_color_manual(values = c("purple","cyan4")) + theme_minimal()
```

---

### Core ideas for Today

A workflow for `tidy` machine learning

* Split the data  
* Pre-Process and Choose a Model  
* Combine into a Workflow  
* Generate Predictions and Assess Model Metrics

---

### Split

```r
split_data <- initial_split(data, 0.75)

train_data <- training(split_data)
test_data <- testing(split_data)
```

---

### Pre-Process & choose a model

```r
model_recipe <- recipe(pred ~ predictors, data = train_data) 
```

```r
# Choose a model and an engine
lr_mod <- logistic_reg(mode = "classification") %>% 
  set_engine("glm")
```

---

### Combine into a workflow

```r
# Combine the model and recipe to the workflow
lr_wflow <- workflow() %>% 
  add_recipe(model_recipe) %>% 
  add_model(lr_mod)
           
# Fit/train the model
model_fit <- lr_wflow %>% 
  fit(data = train_data)
```

---

### Predict and get metrics

```r
# Get predictions
pred_lr <- predict(penguin_fit_lr, test_data)

# Check metrics
pred_lr %>% 
  metrics(truth = pred, .pred_class) %>% 
  bind_cols(select(test_data, pred)) %>% 
  bind_cols(predict(fit_lr, test_data, type = "prob"))
```

---

```r
# Split
penguin_split <- initial_split(penguins_df, strata = sex)

# Split into test/train
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
```
#### **Pre-Process & Choose a model**

```r
penguin_rec <- recipe(sex ~ ., data = penguins_df)

# Choose a model and an engine
lr_mod <- logistic_reg(mode = "classification") %>% 
  set_engine("glm")
```
#### **Combine into a workflow**

```r
# Combine the model and recipe to the workflow
penguin_wf <- workflow() %>%
  add_recipe(penguin_rec)

# Fit/train the model
penguin_fit_lr <- penguin_wf %>% 
  add_model(glm_spec) %>% 
  fit(data = penguin_train)
```
#### **Predict and get metrics**

```r
# Get predictions
penguin_pred_lr <- predict(penguin_fit_lr, penguin_test) %>% 
  bind_cols(penguin_test %>% select(sex)) %>%   # Add back a "truth" column for what the actual sex was
  bind_cols(predict(penguin_fit_lr, penguin_test, type = "prob")) # Get probabilities for the class for each observation

# Check metrics
penguin_pred_lr %>%
  # get Area under Curve
  roc_auc(truth = sex, .pred_female)
```

]

---
class: inverse, center, middle

# `rsample`
![](data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/rsample/master/man/figures/logo.png)

---

### `rsample`

Now that we've shown the overview, we'll start with `tidymodels` proper.

`rsample` at a mininum does your train/test split, but also takes care of things like boostrapping, stratification, v-fold cross validation, validation splits, rolling origin, etc.

---

## Data cleaning

```r
penguins_df <- penguins %>%
  filter(!is.na(sex)) %>%
  select(-year, -island)

glimpse(penguins_df)
```

```
## Rows: 333
## Columns: 6
## $ species           <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel…
## $ bill_length_mm    <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6…
## $ bill_depth_mm     <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2…
## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18…
## $ body_mass_g       <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800…
## $ sex               <fct> male, female, female, female, male, female, male, fe…
```

---

## Data Splitting w/ `rsample`

Do the initial split and stratify by sex to make sure there are equal ratios of male vs female in `test` and `train`

```r
set.seed(123)
penguin_split <- initial_split(penguins_df, strata = sex)

penguin_split
```

```
## <Analysis/Assess/Total>
## <249/84/333>
```

```r
# separate the training data
penguin_train <- training(penguin_split)

# separate the testing data
penguin_test <- testing(penguin_split)
```

---

### Test vs Train

Split into `train_data` and `test_data` and then confirm the ratios.

```r
penguin_train %>% 
  count(sex) %>% 
  mutate(ratio = n/sum(n))
```

```
## # A tibble: 2 × 3
##   sex        n ratio
##   <fct>  <int> <dbl>
## 1 female   123 0.494
## 2 male     126 0.506
```

```r
penguin_test %>% 
  count(sex) %>% 
  mutate(ratio = n/sum(n))
```

```
## # A tibble: 2 × 3
##   sex        n ratio
##   <fct>  <int> <dbl>
## 1 female    42   0.5
## 2 male      42   0.5
```

---

## Bootstraps

We can also do our bootstraps very quickly!

> A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn with replacement

```r
set.seed(123)
penguin_boot <- bootstraps(penguin_train)
penguin_boot
```

```
## # Bootstrap sampling 
## # A tibble: 25 × 2
##    splits           id         
##    <list>           <chr>      
##  1 <split [249/93]> Bootstrap01
##  2 <split [249/91]> Bootstrap02
##  3 <split [249/90]> Bootstrap03
##  4 <split [249/91]> Bootstrap04
##  5 <split [249/85]> Bootstrap05
##  6 <split [249/87]> Bootstrap06
##  7 <split [249/94]> Bootstrap07
##  8 <split [249/88]> Bootstrap08
##  9 <split [249/95]> Bootstrap09
## 10 <split [249/89]> Bootstrap10
## # … with 15 more rows
```

---

# Model `recipes`
![](data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/recipes/master/man/figures/logo.png)

---

## Add recipe steps with `recipes`

`recipe` steps are changes we make to the dataset, including things like centering, dummy encoding, update columns as ID only, or even custom feature engineering.

---

### In `recipes` vs `dplyr`/`tidyr`

* In `tidyverse`, you can do reshaping or basic cleaning   
* In `recipes` it's best to do statistical transformations or other steps intended for modeling  
  
--

Recommended preprocessing per model being fit in ["Tidy Modeling with R" appendix](https://www.tmwr.org/pre-proc-table.html)

- **dummy**: Do qualitative predictors require a numeric encoding (e.g. via dummy variables or other methods).

- **zv**: Should columns with a single unique value be removed?

- **impute**: If some predictors are missing, should they be estimated via imputation?

- **decorrelate**: If there are correlated predictors, should this correlation be mitigated? This might mean filtering out predictors, using principal component analysis, or a model-based technique.

- **normalize**: Should predictors be centered and scaled?

- **transform**: Is it helpful to transform predictors to be more symmetric?

---

### `usemodels`

Relatively early in package life cycle, but helps with boilerplate

```r
# can use your real data to create the repeated portions
usemodels::use_ranger(sex ~ ., penguin_train)
```

```
## ranger_recipe <- 
##   recipe(formula = sex ~ ., data = penguin_train) 
## 
## ranger_spec <- 
##   rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% 
##   set_mode("classification") %>% 
##   set_engine("ranger") 
## 
## ranger_workflow <- 
##   workflow() %>% 
##   add_recipe(ranger_recipe) %>% 
##   add_model(ranger_spec) 
## 
## set.seed(12151)
## ranger_tune <-
##   tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points"))
```

---
class:inverse, center, middle

# `parsnip`

---

## Choose a model and start your engines!

`parsnip` supplies a general modeling interface to the wide world of R models!

```r
# Note that mode = "classification" is the default here anyway!
glm_spec <- logistic_reg() %>%
  set_engine("glm") # different implementations of same "type" of algorithim
```

---

## Recipes

Again, `recipes` are used for pre-processing and feature engineering but at a minimum also defeine the model formula for predictors and prediction (ie `sex ~ species + bill_length_mm`)

```r
penguin_rec <- recipe(sex ~ ., data = penguins_df)
```

---
class:inverse, center, middle

# `workflows`

---

### Combine into a `workflow`

We can now combine the model and recipe into a `workflow` - this allows us to define exactly what model and data are going into our `fit`/train call.

```r
penguin_wf <- workflow() %>%
  add_recipe(penguin_rec)
```

### What is a `workflow`?

A workflow is an object that can bundle together your pre-processing, modeling, and post-processing requests. If you have a `recipe` + `parsnip` model, these can be combined into a workflow. The advantages are:

* You don’t have to keep track of separate objects in your workspace.

* The recipe prepping and model fitting can be executed using a single call to `fit()`.

* If you have custom tuning parameter settings, these can be defined using a simpler interface when combined with `tune`.

---

### Steps so far

- Build a recipe for any pre-processing  
- Choose and build a model  
- Combine them into a `workflow`

---

## Fit/train the model with `parsnip`

Now we can move forward with fitting/training the model - this is really a one-liner.

```r
penguin_fit_lr <- penguin_wf %>% 
  add_model(glm_spec) %>% 
  fit(data = penguin_train) # fit the model against the training data
```

---

## Predict outcomes with `parsnip`

After the model has been trained we can compare the training data against the holdout testing data.

```r
penguin_pred_lr <- predict(penguin_fit_lr, penguin_test) %>% 
  # Add back a "truth" column for what the actual play_type was
  bind_cols(penguin_test %>% select(sex)) %>% 
  # Get probabilities for the class for each observation
  bind_cols(predict(penguin_fit_lr, penguin_test, type = "prob"))
```

```
## # A tibble: 84 × 4
##    .pred_class sex    .pred_female .pred_male
##    <fct>       <fct>         <dbl>      <dbl>
##  1 female      female   0.597         0.403  
##  2 female      female   0.928         0.0724 
##  3 female      female   0.647         0.353  
##  4 male        female   0.219         0.781  
##  5 male        male     0.0132        0.987  
##  6 female      female   0.970         0.0298 
##  7 male        male     0.0000232     1.00   
##  8 female      female   0.872         0.128  
##  9 female      female   0.998         0.00250
## 10 male        male     0.00000253    1.00   
## # … with 74 more rows
```

---

## Predict outcomes with `parsnip`

Previous code of `predict() %>% bind_cols() %>% bind_cols()` is equivalent to the below:

```r
penguin_pred_lr_last <- last_fit(
  glm_spec, 
  recipe(sex ~ ., data = penguins_df), 
  split = penguin_split)

penguin_pred_lr_last %>% 
  pluck(".predictions", 1)
```

```
## # A tibble: 84 × 6
##    .pred_female .pred_male  .row .pred_class sex    .config             
##           <dbl>      <dbl> <int> <fct>       <fct>  <chr>               
##  1   0.597         0.403       2 female      female Preprocessor1_Model1
##  2   0.928         0.0724      3 female      female Preprocessor1_Model1
##  3   0.647         0.353       4 female      female Preprocessor1_Model1
##  4   0.219         0.781      18 male        female Preprocessor1_Model1
##  5   0.0132        0.987      25 male        male   Preprocessor1_Model1
##  6   0.970         0.0298     28 female      female Preprocessor1_Model1
##  7   0.0000232     1.00       31 male        male   Preprocessor1_Model1
##  8   0.872         0.128      34 female      female Preprocessor1_Model1
##  9   0.998         0.00250    38 female      female Preprocessor1_Model1
## 10   0.00000253    1.00       39 male        male   Preprocessor1_Model1
## # … with 74 more rows
```

---
class: inverse, middle, center

# Assessing Accuracy with `yardstick`
![](data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/yardstick/master/man/figures/logo.png)

---

### Check outcomes with `yardstick`

For confirming how well the model predicts, we can use `yardstick` to plot ROC curves, get AUC and collect general metrics.

```r
penguin_pred_lr %>%
  # get Area under Curve
  roc_auc(truth = sex,
          .pred_female)
```

```
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.938
```

```r
penguin_pred_lr %>% 
  # collect and report metrics
  metrics(truth = sex, 
          .pred_class)
```

```
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.857
## 2 kap      binary         0.714
```

]

```r
penguin_pred_lr %>% 
  # calculate ROC curve
  roc_curve(truth = sex, .pred_female) %>% 
  autoplot()
```

]

---

### Note on Checking Outcomes

You *could* use `last_fit()`:  
> This functions is intended to be used after fitting a variety of models and the final tuning parameters (if any) have been finalized. The next step would be to fit using the entire training set and verify performance using the test data.

```r
lr_last_fit <- last_fit(glm_spec, recipe(sex ~ ., data = penguins_df), split = penguin_split)

collect_metrics(lr_last_fit)
```

```
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.857 Preprocessor1_Model1
## 2 roc_auc  binary         0.938 Preprocessor1_Model1
```

---

```r
# Split
penguin_split <- initial_split(penguins_df, strata = sex)

# Split into test/train
penguin_train <- training(penguin_split)
penguin_test <- testing(penguin_split)
```
#### **Pre-Process & Choose a model**

```r
penguin_rec <- recipe(sex ~ ., data = penguins_df)

# Choose a model and an engine
lr_mod <- logistic_reg(mode = "classification") %>%
  set_engine("glm")
```
#### **Combine into a workflow**

```r
# Combine the model and recipe to the workflow
penguin_wf <- workflow() %>%
  add_recipe(penguin_rec)

# Fit/train the model
penguin_fit_lr <- penguin_wf %>%
  add_model(glm_spec) %>%
  fit(data = penguin_train)
```
#### **Predict and get metrics**

```r
# Get predictions
penguin_pred_lr <- predict(penguin_fit_lr, penguin_test) %>%
  bind_cols(penguin_test %>% select(sex)) %>% # Add back a "truth" column for what the actual sex was
  bind_cols(predict(penguin_fit_lr, penguin_test, type = "prob")) # Get probabilities for the class for each observation

# Check metrics
penguin_pred_lr %>%
  # get Area under Curve
  roc_auc(
    truth = sex,
    .pred_female
  )
```
  
]

---

## Change the model

How about a Random Forest model? Just change the model and re-run!

```r
rf_mod <- rand_forest(trees = 100) %>% 
  set_engine("ranger", 
             importance = "impurity", # variable importance
             num.threads = 4) %>%     # Parallelize
  set_mode("classification")

rf_wflow <- workflow() %>% 
  add_recipe(penguin_rec) %>% # Same recipe
* add_model(rf_mod)           # New model

*penguin_fit_rf <- rf_wflow %>% # New workflow
  fit(data = penguin_train)   # Fit the Random Forest

# Get predictions and check metrics
penguin_pred_rf <- predict(penguin_fit_rf, penguin_test) %>% 
  bind_cols(penguin_test %>% select(sex)) %>% 
  bind_cols(predict(penguin_fit_rf, penguin_test, type = "prob"))
```

---

### Feature Importance

```r
penguin_fit_rf %>%   
  pull_workflow_fit() %>% 
  vip()
```

]

---

### Quick Model Comparison

The random forest model is basically comparable to a logistic regression for this toy example

```r
penguin_pred_lr %>% # Logistic Regression predictions
  metrics(truth = sex, .pred_class)
```

```
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.857
## 2 kap      binary         0.714
```

```r
penguin_pred_rf %>% # Random Forest predictions
  metrics(truth = sex, .pred_class)
```

```
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.881
## 2 kap      binary         0.762
```

---

### Quick Model Comparison

```r
penguin_pred_lr %>% # Logistic Regression predictions
  conf_mat(truth = sex, .pred_class)
```

```
##           Truth
## Prediction female male
##     female     37    7
##     male        5   35
```

```r
penguin_pred_rf %>% # Random Forest predictions
  conf_mat(truth = sex, .pred_class)
```

```
##           Truth
## Prediction female male
##     female     37    5
##     male        5   37
```

---

### Comparing Models Together

```r
roc_rf <- penguin_pred_rf %>% 
  roc_curve(truth = sex, .pred_female) %>% 
  mutate(model = "Random Forest")

roc_lr <- penguin_pred_lr %>% 
  roc_curve(truth = sex, .pred_female) %>% 
  mutate(model = "Logistic Regression")

full_plot <- bind_rows(roc_rf, roc_lr) %>% 
  # Note that autoplot() works here!
  ggplot(aes(x = 1 - specificity, 
             y = sensitivity, 
             color = model)) + 
  geom_path(lwd = 1, alpha = 0.5) +
  geom_abline(lty = 3) + 
  scale_color_manual(
    values = c("#374785", "#E98074")
    ) +
  theme_minimal() +
  theme(legend.position = "top",
        legend.title = element_blank())
```

]

```r
full_plot
```

]

---

### Quick Re-Cap

A workflow for `tidy` modeling

* Split the data  
* Pre-Process and Choose a Model  
* Combine into a Workflow  
* Generate Predictions and Assess Model Metrics

So the unified interface hopefully makes the idea of learning and applying many algorithms easier.

`tidymodels` *really* shines when you start to go further or apply best practices like:  
* Resampling, Cross Validation, Bootstrapping  
* Model Tuning and Model Optimization  
* Grid Search, Iterative Search

---
class: inverse, middle, center

## A Deeper Dive on Best Practices

---

### Comparing Models

Previously we've just compared two models by seeing how accurate they were on our `testing` data, but....

> The test set as the data that *should* be used to conduct a proper evaluation of model performance on the **final model(s)**. This begs the question of, “How can we tell what is best if we don’t measure performance until the test set?”

> However, we usually need to understand the effectiveness of the model *before using the test set*.

- [*Tidy Modeling with R*](https://www.tmwr.org/resampling.html#resampling)

---

### Bootstrap Resampling

> Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times.

> A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn with replacement

[Get Started w/ Resampling](https://www.tidymodels.org/start/resampling/) and test drive on [RStudio Cloud](https://rstudio.cloud/project/1479888).
[*Tidy Modeling with R*](https://www.tmwr.org/resampling.html#resampling-methods)

---

### Bootstrap resampling

```r
set.seed(123)
penguin_boot <- bootstraps(penguin_train)
penguin_boot
```

---

### Recipes

```r
glm_spec <- logistic_reg() %>%
  set_engine("glm")
```

```r
rf_spec <- rand_forest() %>%
  set_mode("classification") %>%
  set_engine("ranger")
```

---

### Estimate Performance w/ Resampling

NOTE: Fitting the model multiple times can take a while with larger models or more folds/repeats! I recommend running this as a background job in RStudio, so you don't lock up your session for the duration.

```r
glm_rs <- penguin_wf %>%
  add_model(glm_spec) %>%
  fit_resamples(resamples = penguin_boot,
    control = control_resamples(save_pred = TRUE))

glm_rs
```

```
## # Resampling results
## # Bootstrap sampling 
## # A tibble: 25 × 5
##    splits           id          .metrics         .notes           .predictions  
##    <list>           <chr>       <list>           <list>           <list>        
##  1 <split [249/93]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [93 ×…
##  2 <split [249/91]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×…
##  3 <split [249/90]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [90 ×…
##  4 <split [249/91]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×…
##  5 <split [249/85]> Bootstrap05 <tibble [2 × 4]> <tibble [1 × 1]> <tibble [85 ×…
##  6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [87 ×…
##  7 <split [249/94]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [94 ×…
##  8 <split [249/88]> Bootstrap08 <tibble [2 × 4]> <tibble [1 × 1]> <tibble [88 ×…
##  9 <split [249/95]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [95 ×…
## 10 <split [249/89]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [89 ×…
## # … with 15 more rows
```

---

### Estimate Performance

Random Forest as well!

```r
rf_rs <- penguin_wf %>%
  add_model(rf_spec) %>%
  fit_resamples(resamples = penguin_boot,
    control = control_resamples(save_pred = TRUE))

rf_rs
```

```
## # Resampling results
## # Bootstrap sampling 
## # A tibble: 25 × 5
##    splits           id          .metrics         .notes           .predictions  
##    <list>           <chr>       <list>           <list>           <list>        
##  1 <split [249/93]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [93 ×…
##  2 <split [249/91]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×…
##  3 <split [249/90]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [90 ×…
##  4 <split [249/91]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×…
##  5 <split [249/85]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [85 ×…
##  6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [87 ×…
##  7 <split [249/94]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [94 ×…
##  8 <split [249/88]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [88 ×…
##  9 <split [249/95]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [95 ×…
## 10 <split [249/89]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [89 ×…
## # … with 15 more rows
```

---

### What just happened???

We just fit a model for each resample, evaluated it against a within resample assessment set, and stored it into a single `tibble`!

```r
rf_rs %>% 
  # grab specific columns and resamples
  pluck(".metrics", 1) 
```

```
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.925 Preprocessor1_Model1
## 2 roc_auc  binary         0.974 Preprocessor1_Model1
```

]

```r
rf_rs %>% 
  # grab specific columns and resamples
  pluck(".predictions", 10)
```

```
## # A tibble: 89 × 6
##    .pred_female .pred_male  .row .pred_class sex    .config             
##           <dbl>      <dbl> <int> <fct>       <fct>  <chr>               
##  1        0.929   0.0708       1 female      female Preprocessor1_Model1
##  2        0.918   0.0822       2 female      female Preprocessor1_Model1
##  3        0.953   0.0472       3 female      female Preprocessor1_Model1
##  4        0.707   0.293        9 female      female Preprocessor1_Model1
##  5        0.997   0.00262     14 female      female Preprocessor1_Model1
##  6        0.941   0.0591      18 female      female Preprocessor1_Model1
##  7        0.959   0.0410      19 female      female Preprocessor1_Model1
##  8        0.987   0.0127      20 female      female Preprocessor1_Model1
##  9        0.999   0.000622    21 female      female Preprocessor1_Model1
## 10        0.946   0.0537      24 female      female Preprocessor1_Model1
## # … with 79 more rows
```

]

]
---

### What else can you do?

```r
# Summarize all metrics
rf_rs %>% 
  collect_metrics(summarize = TRUE)
```

```
## # A tibble: 2 × 6
##   .metric  .estimator  mean     n std_err .config             
##   <chr>    <chr>      <dbl> <int>   <dbl> <chr>               
## 1 accuracy binary     0.913    25 0.00589 Preprocessor1_Model1
## 2 roc_auc  binary     0.976    25 0.00208 Preprocessor1_Model1
```

]

```r
rf_rs %>% 
  # combine ALL predictions
  collect_predictions()
```

```
## # A tibble: 2,270 × 7
##    id          .pred_female .pred_male  .row .pred_class sex    .config         
##    <chr>              <dbl>      <dbl> <int> <fct>       <fct>  <chr>           
##  1 Bootstrap01        0.765     0.235      2 female      female Preprocessor1_M…
##  2 Bootstrap01        0.966     0.0343     3 female      female Preprocessor1_M…
##  3 Bootstrap01        0.684     0.316      8 female      female Preprocessor1_M…
##  4 Bootstrap01        0.712     0.288     12 female      female Preprocessor1_M…
##  5 Bootstrap01        0.985     0.0150    15 female      female Preprocessor1_M…
##  6 Bootstrap01        0.877     0.123     18 female      female Preprocessor1_M…
##  7 Bootstrap01        0.987     0.0128    19 female      female Preprocessor1_M…
##  8 Bootstrap01        0.926     0.0740    27 female      female Preprocessor1_M…
##  9 Bootstrap01        0.957     0.0430    28 female      female Preprocessor1_M…
## 10 Bootstrap01        0.561     0.439     29 female      female Preprocessor1_M…
## # … with 2,260 more rows
```

]

---

### Collect metrics

First show our predicted model with compared against our test data.

```r
set.seed(20201024)
```

```r
# Naive Model on Testing Data
rf_compare_df <- bind_rows(
  accuracy(
    penguin_pred_rf, 
    truth = sex, .pred_class
    ),
  roc_auc(
    penguin_pred_rf, 
    truth = sex, .pred_female
    )
  )
```

And then the what our resampled data looks like, which still would leave our test data as unseen.

```r
combo_plot <- rf_rs %>% 
  collect_metrics(summarize = FALSE) %>% 
  ggplot(aes(x = .metric, y = .estimate)) +
  geom_jitter(width = 0.2) +
  geom_boxplot(width = 0.3, alpha = 0.5) +
  geom_point(
    data = rf_compare_df,
    color = "red", size = 5)
```
]

]

.tiny[
.pull-right[
<img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-52-1.png" width="432" />

]

---

### Estimate Performance w/ Bootstrap resamples

```r
assess_res <- collect_predictions(rf_rs)

assess_res
```

---
class: inverse, middle, center

## Model Tuning with `tune`<br><img src='https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tune.svg' height="150"></img>

---

### `tune`

We never adjusted our model! We just used naive models and evaluated their performance.

Now, their performance was pretty decent (~90% accuracy), but could we get better?

[Get Started with Tuning](https://www.tidymodels.org/start/tuning/) and test drive on [RStudio Cloud](https://rstudio.cloud/project/1479888)

---

### Resample + Tune

We're going to use grid-search for our tuning process, and we also need to specify which hyperparameters of our random forest we want to tune.

>Note: A hyperparameter is a parameter who value is used to control the learning process - [Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning))

```r
tune_penguin_rf <- rand_forest(
  mtry = tune(), # add placeholder for tune
  trees = 100,
  min_n = tune() # add placeholder for tune
) %>% 
  set_mode("classification") %>% 
  set_engine("ranger")

tune_rf_wf <- workflow() %>% 
  add_recipe(penguin_rec) %>% 
  add_model(tune_penguin_rf)
```
]

```r
tune_rf_wf
```

```
## ══ Workflow ════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: rand_forest()
## 
## ── Preprocessor ────────────────────────────────────────────────────────────────
## 0 Recipe Steps
## 
## ── Model ───────────────────────────────────────────────────────────────────────
## Random Forest Model Specification (classification)
## 
## Main Arguments:
##   mtry = tune()
##   trees = 100
##   min_n = tune()
## 
## Computational engine: ranger
```

]

---

### Grid Search

We'll create a grid of possible hyperparameters and then estimate how well they fit with our resamples.

Note that this took about 20 min to run!

I'm doing 15x models by 5x folds, where we train a model and predict outcomes each time! The beauty here is that you could run this as a background job.

```r
set.seed(20210430)

penguin_folds <- vfold_cv(penguin_train, v = 5)

tune_res <- tune_grid(
  tune_rf_wf,
  resamples = penguin_folds,
  grid = 15, # 15 combos of model parameters
  control = control_grid(verbose = TRUE)
)
```

---

### Grid Search

Here are the results!

```r
tune_res
```

```
## # Tuning results
## # 5-fold cross-validation 
## # A tibble: 5 × 4
##   splits           id    .metrics          .notes          
##   <list>           <chr> <list>            <list>          
## 1 <split [199/50]> Fold1 <tibble [30 × 6]> <tibble [0 × 1]>
## 2 <split [199/50]> Fold2 <tibble [30 × 6]> <tibble [0 × 1]>
## 3 <split [199/50]> Fold3 <tibble [30 × 6]> <tibble [0 × 1]>
## 4 <split [199/50]> Fold4 <tibble [30 × 6]> <tibble [0 × 1]>
## 5 <split [200/49]> Fold5 <tibble [30 × 6]> <tibble [0 × 1]>
```

---

### Check it out

It's nested tibbles for the split data, the fold id, metrics, and any notes.

```r
# Essentially the same as tune_res[[".metrics"]][[1]]
tune_res %>% 
  pluck(".metrics", 3)
```

```
## # A tibble: 30 × 6
##     mtry min_n .metric  .estimator .estimate .config              
##    <int> <int> <chr>    <chr>          <dbl> <chr>                
##  1     2     7 accuracy binary         0.9   Preprocessor1_Model01
##  2     2     7 roc_auc  binary         0.976 Preprocessor1_Model01
##  3     3    16 accuracy binary         0.88  Preprocessor1_Model02
##  4     3    16 roc_auc  binary         0.975 Preprocessor1_Model02
##  5     4    21 accuracy binary         0.9   Preprocessor1_Model03
##  6     4    21 roc_auc  binary         0.968 Preprocessor1_Model03
##  7     2    38 accuracy binary         0.86  Preprocessor1_Model04
##  8     2    38 roc_auc  binary         0.971 Preprocessor1_Model04
##  9     1     8 accuracy binary         0.9   Preprocessor1_Model05
## 10     1     8 roc_auc  binary         0.985 Preprocessor1_Model05
## # … with 20 more rows
```

---

### Check it out

```r
plot_tuned <- tune_res %>% 
  collect_metrics() %>% 
  filter(.metric == "roc_auc") %>%
  dplyr::select(mean, mtry:min_n) %>%
  pivot_longer(mtry:min_n,
               values_to = "value",
               names_to = "parameter"
  ) %>%
  ggplot(aes(value, mean, color = parameter)) +
  geom_point(alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~parameter, scales = "free_x", ncol = 1) +
  labs(x = NULL, y = "AUC")
```
]

]

.tiny[
.pull-right[
<img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-61-1.png" width="288" />

]

---

### Check it out (scaling matters!)

]

```r
plot_tuned +
  scale_y_continuous(limits = c(0.75, 0.98))
```

]

---

### Finalize

Here we are investigating which hyperparameters maximized ROC Area Under the Curve.

```r
# Which 5x were best?
show_best(tune_res, "roc_auc", n = 5)
```

```
## # A tibble: 5 × 8
##    mtry min_n .metric .estimator  mean     n std_err .config              
##   <int> <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
## 1     1     3 roc_auc binary     0.982     5 0.00437 Preprocessor1_Model12
## 2     1     8 roc_auc binary     0.981     5 0.00540 Preprocessor1_Model05
## 3     2     7 roc_auc binary     0.977     5 0.00609 Preprocessor1_Model01
## 4     2    12 roc_auc binary     0.975     5 0.00701 Preprocessor1_Model07
## 5     3    11 roc_auc binary     0.974     5 0.00601 Preprocessor1_Model14
```

```r
# Select the best
best_fit_auc <- select_best(tune_res, "roc_auc")

# Select wflow for the model with best hyperparams
rf_tuned <- finalize_workflow(
  rf_wflow,
  parameters = best_fit_auc
)
```

---

### Finalize

Show the outcomes!

```r
set.seed(20201024)
rf_tuned_fit <- last_fit(rf_tuned, penguin_split)

rf_tuned_fit %>%  # tuned model metrics
  collect_metrics()
```

```
## # A tibble: 2 × 4
##   .metric  .estimator .estimate .config             
##   <chr>    <chr>          <dbl> <chr>               
## 1 accuracy binary         0.857 Preprocessor1_Model1
## 2 roc_auc  binary         0.928 Preprocessor1_Model1
```

```r
rf_compare_df # naive model metrics
```

```
## # A tibble: 2 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.881
## 2 roc_auc  binary         0.927
```

---

### Addendums

- Model training/fitting (or simulation) is likely to be the most time-intensive computation you do - as such, it's a good idea to run them as [**background jobs** in RStudio](https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/)

- Also can turn on verbose reporting so you know where you're at in the Cross-validation or tuning steps  
  - `control_grid(verbose = TRUE)`

---

### Thank you

* All y'all for listening in 🤠

### Learn more

* [`tidymodels`.org](https://www.tidymodels.org/learn/) has step by step guides of various complexities

* Julia Silge's (a `tidymodels` maintainer) [blog](https://juliasilge.com/), [video series](https://www.youtube.com/channel/UCTTBgWyJl2HrrhQOOc710kA), or [free interactive course](https://supervised-ml-course.netlify.app/)

* [Tidy Modeling with R](https://www.tmwr.org/) - get started quickly with `tidymodels`

- [Introduction to Statistical Learning](https://www.statlearning.com/) - understand the math (new edition just came out!)

- [Hands on Machine Learning with R](https://bradleyboehmke.github.io/HOML/) - get started quickly with modeling in R (mix of base R, `caret`, and `tidymodels`)