layout: true <div class="my-footer"><a href='https://jthomasmock.github.io/model-penguins/'>https://jthomasmock.github.io/model-penguins/</a></div> --- class: title-slide, left, top # <span style='color:white;'>Intro to `tidymodels`</span> ## <span style='color:white;'>Tom Mock: [`@thomas_mock`](https://twitter.com/thomas_mock)</span> ### 2021-12-16 <span style='color:white; font-size:110%;'>Slides: [jthomasmock.github.io/model-penguins/](https://jthomasmock.github.io/model-penguins/#1)</span> <span style='color:white; font-size:110%;'>Content: [github.com/jthomasmock/model-penguins](https://github.com/jthomasmock/model-penguins)</span> <span style='color:white;'>Slides released under</span> [CC-BY 2.0](https://creativecommons.org/licenses/by/2.0/)
] <div style = "position: absolute;top: 100px;right: 50px;"><img src="https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidymodels.svg" alt="The tidymodels hex logo" height="450"></img></div> --- ### Focus for Today 60 Minutes Binary classification: - Logistic Regression - Random Forest --- ### Level-Setting As much as I'd love to learn and teach *all* of Machine Learning/Statistics in 60 min... -- It's just not possible! -- ### Goals for today * Make you comfortable with the **syntax** and **packages** via `tidymodels` unified interface * So when you're learning or modeling on your own, you get to focus on the **stats** rather than re-learning different APIs over and over... -- Along the way, we'll cover minimal examples and then some more quick best practices where `tidymodels` makes it easier to do more things! --- # `tidymodels` `tidymodels` is a collection of packages for modeling and machine learning using `tidyverse` principles. ## Packages * `rsample`: efficient data splitting and resampling * `parsnip`: tidy, unified interface to models * `recipes`: tidy interface to data pre-processing tools for feature engineering * `workflows`: bundles your pre-processing, modeling, and post-processing * `tune`: helps optimize the hyperparameters and pre-processing steps * `yardstick`: measures the performance metrics * `dials`: creates and manages tuning parameters/grids * `broom`: converts common R statistical objects into predictable formats * [`broom` available methods](https://broom.tidymodels.org/articles/available-methods.html) --- class:inverse, center, middle # Tidy Machine Learning w/ `tidymodels` <img src="data:image/png;base64,#https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tidymodels.svg" width="25%" /> --- ### Core ideas for Today A workflow for `tidy` machine learning * Split the data * Pre-Process and Choose a Model * Combine into a Workflow * Generate Predictions and Assess Model Metrics --- ### Classification Showing two examples today, comparing their outcomes, and then giving you the chance to explore on your own! --- ### The Dataset Palmer penguins dataset - a modern replacement for R's iris dataset -- #### The goal: Binary classification of sex for the various penguin species -- ```r glimpse(palmerpenguins::penguins) ``` ``` ## Rows: 344 ## Columns: 8 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ island <fct> Torgersen, Torgersen, Torgersen, Torgersen, Torgerse… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, NA, 36.7, 39.3, 38.9, 39.2, 34.1, … ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, NA, 19.3, 20.6, 17.8, 19.6, 18.1, … ## $ flipper_length_mm <int> 181, 186, 195, NA, 193, 190, 181, 195, 193, 190, 186… ## $ body_mass_g <int> 3750, 3800, 3250, NA, 3450, 3650, 3625, 4675, 3475, … ## $ sex <fct> male, female, female, NA, female, male, female, male… ## $ year <int> 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007, 2007… ``` --- ### Penguins ```r penguins %>% filter(!is.na(sex)) %>% ggplot(aes(flipper_length_mm, bill_length_mm, color = sex, size = body_mass_g)) + geom_point(alpha = 0.5) + facet_wrap(~species) + scale_color_manual(values = c("purple","cyan4")) + theme_minimal() ``` <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-4-1.png" width="576" /> --- ### Core ideas for Today A workflow for `tidy` machine learning * Split the data * Pre-Process and Choose a Model * Combine into a Workflow * Generate Predictions and Assess Model Metrics --- ### Split ```r split_data <- initial_split(data, 0.75) train_data <- training(split_data) test_data <- testing(split_data) ``` --- ### Pre-Process & choose a model ```r model_recipe <- recipe(pred ~ predictors, data = train_data) ``` ```r # Choose a model and an engine lr_mod <- logistic_reg(mode = "classification") %>% set_engine("glm") ``` --- ### Combine into a workflow ```r # Combine the model and recipe to the workflow lr_wflow <- workflow() %>% add_recipe(model_recipe) %>% add_model(lr_mod) # Fit/train the model model_fit <- lr_wflow %>% fit(data = train_data) ``` --- ### Predict and get metrics ```r # Get predictions pred_lr <- predict(penguin_fit_lr, test_data) # Check metrics pred_lr %>% metrics(truth = pred, .pred_class) %>% bind_cols(select(test_data, pred)) %>% bind_cols(predict(fit_lr, test_data, type = "prob")) ``` --- .small[ #### **Split** ```r # Split penguin_split <- initial_split(penguins_df, strata = sex) # Split into test/train penguin_train <- training(penguin_split) penguin_test <- testing(penguin_split) ``` #### **Pre-Process & Choose a model** ```r penguin_rec <- recipe(sex ~ ., data = penguins_df) # Choose a model and an engine lr_mod <- logistic_reg(mode = "classification") %>% set_engine("glm") ``` #### **Combine into a workflow** ```r # Combine the model and recipe to the workflow penguin_wf <- workflow() %>% add_recipe(penguin_rec) # Fit/train the model penguin_fit_lr <- penguin_wf %>% add_model(glm_spec) %>% fit(data = penguin_train) ``` #### **Predict and get metrics** ```r # Get predictions penguin_pred_lr <- predict(penguin_fit_lr, penguin_test) %>% bind_cols(penguin_test %>% select(sex)) %>% # Add back a "truth" column for what the actual sex was bind_cols(predict(penguin_fit_lr, penguin_test, type = "prob")) # Get probabilities for the class for each observation # Check metrics penguin_pred_lr %>% # get Area under Curve roc_auc(truth = sex, .pred_female) ``` ] --- class: inverse, center, middle # `rsample` ![](data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/rsample/master/man/figures/logo.png) --- ### `rsample` Now that we've shown the overview, we'll start with `tidymodels` proper. `rsample` at a mininum does your train/test split, but also takes care of things like boostrapping, stratification, v-fold cross validation, validation splits, rolling origin, etc. --- ## Data cleaning ```r penguins_df <- penguins %>% filter(!is.na(sex)) %>% select(-year, -island) glimpse(penguins_df) ``` ``` ## Rows: 333 ## Columns: 6 ## $ species <fct> Adelie, Adelie, Adelie, Adelie, Adelie, Adelie, Adel… ## $ bill_length_mm <dbl> 39.1, 39.5, 40.3, 36.7, 39.3, 38.9, 39.2, 41.1, 38.6… ## $ bill_depth_mm <dbl> 18.7, 17.4, 18.0, 19.3, 20.6, 17.8, 19.6, 17.6, 21.2… ## $ flipper_length_mm <int> 181, 186, 195, 193, 190, 181, 195, 182, 191, 198, 18… ## $ body_mass_g <int> 3750, 3800, 3250, 3450, 3650, 3625, 4675, 3200, 3800… ## $ sex <fct> male, female, female, female, male, female, male, fe… ``` --- ## Data Splitting w/ `rsample` Do the initial split and stratify by sex to make sure there are equal ratios of male vs female in `test` and `train` ```r set.seed(123) penguin_split <- initial_split(penguins_df, strata = sex) penguin_split ``` ``` ## <Analysis/Assess/Total> ## <249/84/333> ``` ```r # separate the training data penguin_train <- training(penguin_split) # separate the testing data penguin_test <- testing(penguin_split) ``` --- ### Test vs Train Split into `train_data` and `test_data` and then confirm the ratios. ```r penguin_train %>% count(sex) %>% mutate(ratio = n/sum(n)) ``` ``` ## # A tibble: 2 × 3 ## sex n ratio ## <fct> <int> <dbl> ## 1 female 123 0.494 ## 2 male 126 0.506 ``` ```r penguin_test %>% count(sex) %>% mutate(ratio = n/sum(n)) ``` ``` ## # A tibble: 2 × 3 ## sex n ratio ## <fct> <int> <dbl> ## 1 female 42 0.5 ## 2 male 42 0.5 ``` --- ## Bootstraps We can also do our bootstraps very quickly! > A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn with replacement ```r set.seed(123) penguin_boot <- bootstraps(penguin_train) penguin_boot ``` ``` ## # Bootstrap sampling ## # A tibble: 25 × 2 ## splits id ## <list> <chr> ## 1 <split [249/93]> Bootstrap01 ## 2 <split [249/91]> Bootstrap02 ## 3 <split [249/90]> Bootstrap03 ## 4 <split [249/91]> Bootstrap04 ## 5 <split [249/85]> Bootstrap05 ## 6 <split [249/87]> Bootstrap06 ## 7 <split [249/94]> Bootstrap07 ## 8 <split [249/88]> Bootstrap08 ## 9 <split [249/95]> Bootstrap09 ## 10 <split [249/89]> Bootstrap10 ## # … with 15 more rows ``` --- class: inverse, center, middle # Model `recipes` ![](data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/recipes/master/man/figures/logo.png) --- ## Add recipe steps with `recipes` `recipe` steps are changes we make to the dataset, including things like centering, dummy encoding, update columns as ID only, or even custom feature engineering. --- ### In `recipes` vs `dplyr`/`tidyr` * In `tidyverse`, you can do reshaping or basic cleaning * In `recipes` it's best to do statistical transformations or other steps intended for modeling -- Recommended preprocessing per model being fit in ["Tidy Modeling with R" appendix](https://www.tmwr.org/pre-proc-table.html) - **dummy**: Do qualitative predictors require a numeric encoding (e.g. via dummy variables or other methods). - **zv**: Should columns with a single unique value be removed? - **impute**: If some predictors are missing, should they be estimated via imputation? - **decorrelate**: If there are correlated predictors, should this correlation be mitigated? This might mean filtering out predictors, using principal component analysis, or a model-based technique. - **normalize**: Should predictors be centered and scaled? - **transform**: Is it helpful to transform predictors to be more symmetric? --- ### `usemodels` Relatively early in package life cycle, but helps with boilerplate ```r # can use your real data to create the repeated portions usemodels::use_ranger(sex ~ ., penguin_train) ``` ``` ## ranger_recipe <- ## recipe(formula = sex ~ ., data = penguin_train) ## ## ranger_spec <- ## rand_forest(mtry = tune(), min_n = tune(), trees = 1000) %>% ## set_mode("classification") %>% ## set_engine("ranger") ## ## ranger_workflow <- ## workflow() %>% ## add_recipe(ranger_recipe) %>% ## add_model(ranger_spec) ## ## set.seed(12151) ## ranger_tune <- ## tune_grid(ranger_workflow, resamples = stop("add your rsample object"), grid = stop("add number of candidate points")) ``` --- class:inverse, center, middle # `parsnip` <img src="data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/parsnip/master/man/figures/logo.png" width="25%" /> --- ## Choose a model and start your engines! `parsnip` supplies a general modeling interface to the wide world of R models! ```r # Note that mode = "classification" is the default here anyway! glm_spec <- logistic_reg() %>% set_engine("glm") # different implementations of same "type" of algorithim ``` --- ## Recipes Again, `recipes` are used for pre-processing and feature engineering but at a minimum also defeine the model formula for predictors and prediction (ie `sex ~ species + bill_length_mm`) ```r penguin_rec <- recipe(sex ~ ., data = penguins_df) ``` --- class:inverse, center, middle # `workflows` <img src="data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/workflows/master/man/figures/logo.png" width="25%" /> --- ### Combine into a `workflow` We can now combine the model and recipe into a `workflow` - this allows us to define exactly what model and data are going into our `fit`/train call. ```r penguin_wf <- workflow() %>% add_recipe(penguin_rec) ``` ### What is a `workflow`? A workflow is an object that can bundle together your pre-processing, modeling, and post-processing requests. If you have a `recipe` + `parsnip` model, these can be combined into a workflow. The advantages are: * You don’t have to keep track of separate objects in your workspace. * The recipe prepping and model fitting can be executed using a single call to `fit()`. * If you have custom tuning parameter settings, these can be defined using a simpler interface when combined with `tune`. --- ### Steps so far - Build a recipe for any pre-processing - Choose and build a model - Combine them into a `workflow` --- ## Fit/train the model with `parsnip` Now we can move forward with fitting/training the model - this is really a one-liner. ```r penguin_fit_lr <- penguin_wf %>% add_model(glm_spec) %>% fit(data = penguin_train) # fit the model against the training data ``` --- ## Predict outcomes with `parsnip` After the model has been trained we can compare the training data against the holdout testing data. ```r penguin_pred_lr <- predict(penguin_fit_lr, penguin_test) %>% # Add back a "truth" column for what the actual play_type was bind_cols(penguin_test %>% select(sex)) %>% # Get probabilities for the class for each observation bind_cols(predict(penguin_fit_lr, penguin_test, type = "prob")) ``` ``` ## # A tibble: 84 × 4 ## .pred_class sex .pred_female .pred_male ## <fct> <fct> <dbl> <dbl> ## 1 female female 0.597 0.403 ## 2 female female 0.928 0.0724 ## 3 female female 0.647 0.353 ## 4 male female 0.219 0.781 ## 5 male male 0.0132 0.987 ## 6 female female 0.970 0.0298 ## 7 male male 0.0000232 1.00 ## 8 female female 0.872 0.128 ## 9 female female 0.998 0.00250 ## 10 male male 0.00000253 1.00 ## # … with 74 more rows ``` --- ## Predict outcomes with `parsnip` Previous code of `predict() %>% bind_cols() %>% bind_cols()` is equivalent to the below: ```r penguin_pred_lr_last <- last_fit( glm_spec, recipe(sex ~ ., data = penguins_df), split = penguin_split) penguin_pred_lr_last %>% pluck(".predictions", 1) ``` ``` ## # A tibble: 84 × 6 ## .pred_female .pred_male .row .pred_class sex .config ## <dbl> <dbl> <int> <fct> <fct> <chr> ## 1 0.597 0.403 2 female female Preprocessor1_Model1 ## 2 0.928 0.0724 3 female female Preprocessor1_Model1 ## 3 0.647 0.353 4 female female Preprocessor1_Model1 ## 4 0.219 0.781 18 male female Preprocessor1_Model1 ## 5 0.0132 0.987 25 male male Preprocessor1_Model1 ## 6 0.970 0.0298 28 female female Preprocessor1_Model1 ## 7 0.0000232 1.00 31 male male Preprocessor1_Model1 ## 8 0.872 0.128 34 female female Preprocessor1_Model1 ## 9 0.998 0.00250 38 female female Preprocessor1_Model1 ## 10 0.00000253 1.00 39 male male Preprocessor1_Model1 ## # … with 74 more rows ``` --- class: inverse, middle, center # Assessing Accuracy with `yardstick` ![](data:image/png;base64,#https://raw.githubusercontent.com/tidymodels/yardstick/master/man/figures/logo.png) --- ### Check outcomes with `yardstick` For confirming how well the model predicts, we can use `yardstick` to plot ROC curves, get AUC and collect general metrics. .small[ .pull-left[ ```r penguin_pred_lr %>% # get Area under Curve roc_auc(truth = sex, .pred_female) ``` ``` ## # A tibble: 1 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 roc_auc binary 0.938 ``` ```r penguin_pred_lr %>% # collect and report metrics metrics(truth = sex, .pred_class) ``` ``` ## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.857 ## 2 kap binary 0.714 ``` ] ] -- .pull-right[ .small[ ```r penguin_pred_lr %>% # calculate ROC curve roc_curve(truth = sex, .pred_female) %>% autoplot() ``` <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-28-1.png" width="360" /> ] ] --- ### Note on Checking Outcomes You *could* use `last_fit()`: > This functions is intended to be used after fitting a variety of models and the final tuning parameters (if any) have been finalized. The next step would be to fit using the entire training set and verify performance using the test data. ```r lr_last_fit <- last_fit(glm_spec, recipe(sex ~ ., data = penguins_df), split = penguin_split) collect_metrics(lr_last_fit) ``` ``` ## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 accuracy binary 0.857 Preprocessor1_Model1 ## 2 roc_auc binary 0.938 Preprocessor1_Model1 ``` --- .small[ #### **Split** ```r # Split penguin_split <- initial_split(penguins_df, strata = sex) # Split into test/train penguin_train <- training(penguin_split) penguin_test <- testing(penguin_split) ``` #### **Pre-Process & Choose a model** ```r penguin_rec <- recipe(sex ~ ., data = penguins_df) # Choose a model and an engine lr_mod <- logistic_reg(mode = "classification") %>% set_engine("glm") ``` #### **Combine into a workflow** ```r # Combine the model and recipe to the workflow penguin_wf <- workflow() %>% add_recipe(penguin_rec) # Fit/train the model penguin_fit_lr <- penguin_wf %>% add_model(glm_spec) %>% fit(data = penguin_train) ``` #### **Predict and get metrics** ```r # Get predictions penguin_pred_lr <- predict(penguin_fit_lr, penguin_test) %>% bind_cols(penguin_test %>% select(sex)) %>% # Add back a "truth" column for what the actual sex was bind_cols(predict(penguin_fit_lr, penguin_test, type = "prob")) # Get probabilities for the class for each observation # Check metrics penguin_pred_lr %>% # get Area under Curve roc_auc( truth = sex, .pred_female ) ``` ] --- ## Change the model How about a Random Forest model? Just change the model and re-run! ```r rf_mod <- rand_forest(trees = 100) %>% set_engine("ranger", importance = "impurity", # variable importance num.threads = 4) %>% # Parallelize set_mode("classification") rf_wflow <- workflow() %>% add_recipe(penguin_rec) %>% # Same recipe * add_model(rf_mod) # New model *penguin_fit_rf <- rf_wflow %>% # New workflow fit(data = penguin_train) # Fit the Random Forest # Get predictions and check metrics penguin_pred_rf <- predict(penguin_fit_rf, penguin_test) %>% bind_cols(penguin_test %>% select(sex)) %>% bind_cols(predict(penguin_fit_rf, penguin_test, type = "prob")) ``` --- ### Feature Importance .small[ ```r penguin_fit_rf %>% pull_workflow_fit() %>% vip() ``` <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-35-1.png" width="720" /> ] --- ### Quick Model Comparison The random forest model is basically comparable to a logistic regression for this toy example ```r penguin_pred_lr %>% # Logistic Regression predictions metrics(truth = sex, .pred_class) ``` ``` ## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.857 ## 2 kap binary 0.714 ``` ```r penguin_pred_rf %>% # Random Forest predictions metrics(truth = sex, .pred_class) ``` ``` ## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.881 ## 2 kap binary 0.762 ``` --- ### Quick Model Comparison ```r penguin_pred_lr %>% # Logistic Regression predictions conf_mat(truth = sex, .pred_class) ``` ``` ## Truth ## Prediction female male ## female 37 7 ## male 5 35 ``` ```r penguin_pred_rf %>% # Random Forest predictions conf_mat(truth = sex, .pred_class) ``` ``` ## Truth ## Prediction female male ## female 37 5 ## male 5 37 ``` --- ### Comparing Models Together .pull-left[ ```r roc_rf <- penguin_pred_rf %>% roc_curve(truth = sex, .pred_female) %>% mutate(model = "Random Forest") roc_lr <- penguin_pred_lr %>% roc_curve(truth = sex, .pred_female) %>% mutate(model = "Logistic Regression") full_plot <- bind_rows(roc_rf, roc_lr) %>% # Note that autoplot() works here! ggplot(aes(x = 1 - specificity, y = sensitivity, color = model)) + geom_path(lwd = 1, alpha = 0.5) + geom_abline(lty = 3) + scale_color_manual( values = c("#374785", "#E98074") ) + theme_minimal() + theme(legend.position = "top", legend.title = element_blank()) ``` ] .pull-right[ ```r full_plot ``` <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-38-1.png" width="432" /> ] --- ### Quick Re-Cap A workflow for `tidy` modeling * Split the data * Pre-Process and Choose a Model * Combine into a Workflow * Generate Predictions and Assess Model Metrics So the unified interface hopefully makes the idea of learning and applying many algorithms easier. -- `tidymodels` *really* shines when you start to go further or apply best practices like: * Resampling, Cross Validation, Bootstrapping * Model Tuning and Model Optimization * Grid Search, Iterative Search --- class: inverse, middle, center ## A Deeper Dive on Best Practices --- ### Comparing Models Previously we've just compared two models by seeing how accurate they were on our `testing` data, but.... > The test set as the data that *should* be used to conduct a proper evaluation of model performance on the **final model(s)**. This begs the question of, “How can we tell what is best if we don’t measure performance until the test set?” -- > However, we usually need to understand the effectiveness of the model *before using the test set*. - [*Tidy Modeling with R*](https://www.tmwr.org/resampling.html#resampling) --- ### Bootstrap Resampling > Resampling methods are empirical simulation systems that emulate the process of using some data for modeling and different data for evaluation. Most resampling methods are iterative, meaning that this process is repeated multiple times. > A bootstrap sample of the training set is a sample that is the same size as the training set but is drawn with replacement -- [Get Started w/ Resampling](https://www.tidymodels.org/start/resampling/) and test drive on [RStudio Cloud](https://rstudio.cloud/project/1479888). [*Tidy Modeling with R*](https://www.tmwr.org/resampling.html#resampling-methods) --- ### Bootstrap resampling ```r set.seed(123) penguin_boot <- bootstraps(penguin_train) penguin_boot ``` ``` ## # Bootstrap sampling ## # A tibble: 25 × 2 ## splits id ## <list> <chr> ## 1 <split [249/93]> Bootstrap01 ## 2 <split [249/91]> Bootstrap02 ## 3 <split [249/90]> Bootstrap03 ## 4 <split [249/91]> Bootstrap04 ## 5 <split [249/85]> Bootstrap05 ## 6 <split [249/87]> Bootstrap06 ## 7 <split [249/94]> Bootstrap07 ## 8 <split [249/88]> Bootstrap08 ## 9 <split [249/95]> Bootstrap09 ## 10 <split [249/89]> Bootstrap10 ## # … with 15 more rows ``` --- ### Recipes ```r glm_spec <- logistic_reg() %>% set_engine("glm") ``` ```r rf_spec <- rand_forest() %>% set_mode("classification") %>% set_engine("ranger") ``` --- ### Estimate Performance w/ Resampling NOTE: Fitting the model multiple times can take a while with larger models or more folds/repeats! I recommend running this as a background job in RStudio, so you don't lock up your session for the duration. ```r glm_rs <- penguin_wf %>% add_model(glm_spec) %>% fit_resamples(resamples = penguin_boot, control = control_resamples(save_pred = TRUE)) glm_rs ``` ``` ## # Resampling results ## # Bootstrap sampling ## # A tibble: 25 × 5 ## splits id .metrics .notes .predictions ## <list> <chr> <list> <list> <list> ## 1 <split [249/93]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [93 ×… ## 2 <split [249/91]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×… ## 3 <split [249/90]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [90 ×… ## 4 <split [249/91]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×… ## 5 <split [249/85]> Bootstrap05 <tibble [2 × 4]> <tibble [1 × 1]> <tibble [85 ×… ## 6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [87 ×… ## 7 <split [249/94]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [94 ×… ## 8 <split [249/88]> Bootstrap08 <tibble [2 × 4]> <tibble [1 × 1]> <tibble [88 ×… ## 9 <split [249/95]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [95 ×… ## 10 <split [249/89]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [89 ×… ## # … with 15 more rows ``` --- ### Estimate Performance Random Forest as well! ```r rf_rs <- penguin_wf %>% add_model(rf_spec) %>% fit_resamples(resamples = penguin_boot, control = control_resamples(save_pred = TRUE)) rf_rs ``` ``` ## # Resampling results ## # Bootstrap sampling ## # A tibble: 25 × 5 ## splits id .metrics .notes .predictions ## <list> <chr> <list> <list> <list> ## 1 <split [249/93]> Bootstrap01 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [93 ×… ## 2 <split [249/91]> Bootstrap02 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×… ## 3 <split [249/90]> Bootstrap03 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [90 ×… ## 4 <split [249/91]> Bootstrap04 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [91 ×… ## 5 <split [249/85]> Bootstrap05 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [85 ×… ## 6 <split [249/87]> Bootstrap06 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [87 ×… ## 7 <split [249/94]> Bootstrap07 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [94 ×… ## 8 <split [249/88]> Bootstrap08 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [88 ×… ## 9 <split [249/95]> Bootstrap09 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [95 ×… ## 10 <split [249/89]> Bootstrap10 <tibble [2 × 4]> <tibble [0 × 1]> <tibble [89 ×… ## # … with 15 more rows ``` --- ### What just happened??? We just fit a model for each resample, evaluated it against a within resample assessment set, and stored it into a single `tibble`! .small[ .pull-left[ ```r rf_rs %>% # grab specific columns and resamples pluck(".metrics", 1) ``` ``` ## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 accuracy binary 0.925 Preprocessor1_Model1 ## 2 roc_auc binary 0.974 Preprocessor1_Model1 ``` ] ] .small[ .pull-right[ ```r rf_rs %>% # grab specific columns and resamples pluck(".predictions", 10) ``` ``` ## # A tibble: 89 × 6 ## .pred_female .pred_male .row .pred_class sex .config ## <dbl> <dbl> <int> <fct> <fct> <chr> ## 1 0.929 0.0708 1 female female Preprocessor1_Model1 ## 2 0.918 0.0822 2 female female Preprocessor1_Model1 ## 3 0.953 0.0472 3 female female Preprocessor1_Model1 ## 4 0.707 0.293 9 female female Preprocessor1_Model1 ## 5 0.997 0.00262 14 female female Preprocessor1_Model1 ## 6 0.941 0.0591 18 female female Preprocessor1_Model1 ## 7 0.959 0.0410 19 female female Preprocessor1_Model1 ## 8 0.987 0.0127 20 female female Preprocessor1_Model1 ## 9 0.999 0.000622 21 female female Preprocessor1_Model1 ## 10 0.946 0.0537 24 female female Preprocessor1_Model1 ## # … with 79 more rows ``` ] ] --- ### What else can you do? .small[ .pull-left[ ```r # Summarize all metrics rf_rs %>% collect_metrics(summarize = TRUE) ``` ``` ## # A tibble: 2 × 6 ## .metric .estimator mean n std_err .config ## <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 accuracy binary 0.913 25 0.00589 Preprocessor1_Model1 ## 2 roc_auc binary 0.976 25 0.00208 Preprocessor1_Model1 ``` ] ] .pull-right[ ```r rf_rs %>% # combine ALL predictions collect_predictions() ``` .small[ ``` ## # A tibble: 2,270 × 7 ## id .pred_female .pred_male .row .pred_class sex .config ## <chr> <dbl> <dbl> <int> <fct> <fct> <chr> ## 1 Bootstrap01 0.765 0.235 2 female female Preprocessor1_M… ## 2 Bootstrap01 0.966 0.0343 3 female female Preprocessor1_M… ## 3 Bootstrap01 0.684 0.316 8 female female Preprocessor1_M… ## 4 Bootstrap01 0.712 0.288 12 female female Preprocessor1_M… ## 5 Bootstrap01 0.985 0.0150 15 female female Preprocessor1_M… ## 6 Bootstrap01 0.877 0.123 18 female female Preprocessor1_M… ## 7 Bootstrap01 0.987 0.0128 19 female female Preprocessor1_M… ## 8 Bootstrap01 0.926 0.0740 27 female female Preprocessor1_M… ## 9 Bootstrap01 0.957 0.0430 28 female female Preprocessor1_M… ## 10 Bootstrap01 0.561 0.439 29 female female Preprocessor1_M… ## # … with 2,260 more rows ``` ] ] --- ### Collect metrics First show our predicted model with compared against our test data. -- .small[ .pull-left[ ```r set.seed(20201024) ``` ```r # Naive Model on Testing Data rf_compare_df <- bind_rows( accuracy( penguin_pred_rf, truth = sex, .pred_class ), roc_auc( penguin_pred_rf, truth = sex, .pred_female ) ) ``` And then the what our resampled data looks like, which still would leave our test data as unseen. ```r combo_plot <- rf_rs %>% collect_metrics(summarize = FALSE) %>% ggplot(aes(x = .metric, y = .estimate)) + geom_jitter(width = 0.2) + geom_boxplot(width = 0.3, alpha = 0.5) + geom_point( data = rf_compare_df, color = "red", size = 5) ``` ] ] -- .tiny[ .pull-right[ <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-52-1.png" width="432" /> ] ] --- ### Estimate Performance w/ Bootstrap resamples ```r assess_res <- collect_predictions(rf_rs) assess_res ``` ``` ## # A tibble: 2,270 × 7 ## id .pred_female .pred_male .row .pred_class sex .config ## <chr> <dbl> <dbl> <int> <fct> <fct> <chr> ## 1 Bootstrap01 0.765 0.235 2 female female Preprocessor1_M… ## 2 Bootstrap01 0.966 0.0343 3 female female Preprocessor1_M… ## 3 Bootstrap01 0.684 0.316 8 female female Preprocessor1_M… ## 4 Bootstrap01 0.712 0.288 12 female female Preprocessor1_M… ## 5 Bootstrap01 0.985 0.0150 15 female female Preprocessor1_M… ## 6 Bootstrap01 0.877 0.123 18 female female Preprocessor1_M… ## 7 Bootstrap01 0.987 0.0128 19 female female Preprocessor1_M… ## 8 Bootstrap01 0.926 0.0740 27 female female Preprocessor1_M… ## 9 Bootstrap01 0.957 0.0430 28 female female Preprocessor1_M… ## 10 Bootstrap01 0.561 0.439 29 female female Preprocessor1_M… ## # … with 2,260 more rows ``` --- class: inverse, middle, center ## Model Tuning with `tune`<br><img src='https://raw.githubusercontent.com/rstudio/hex-stickers/master/SVG/tune.svg' height="150"></img> --- ### `tune` We never adjusted our model! We just used naive models and evaluated their performance. Now, their performance was pretty decent (~90% accuracy), but could we get better? -- [Get Started with Tuning](https://www.tidymodels.org/start/tuning/) and test drive on [RStudio Cloud](https://rstudio.cloud/project/1479888) --- ### Resample + Tune We're going to use grid-search for our tuning process, and we also need to specify which hyperparameters of our random forest we want to tune. >Note: A hyperparameter is a parameter who value is used to control the learning process - [Wikipedia](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)) .pull-left[ ```r tune_penguin_rf <- rand_forest( mtry = tune(), # add placeholder for tune trees = 100, min_n = tune() # add placeholder for tune ) %>% set_mode("classification") %>% set_engine("ranger") tune_rf_wf <- workflow() %>% add_recipe(penguin_rec) %>% add_model(tune_penguin_rf) ``` ] .small[ .pull-right[ ```r tune_rf_wf ``` ``` ## ══ Workflow ════════════════════════════════════════════════════════════════════ ## Preprocessor: Recipe ## Model: rand_forest() ## ## ── Preprocessor ──────────────────────────────────────────────────────────────── ## 0 Recipe Steps ## ## ── Model ─────────────────────────────────────────────────────────────────────── ## Random Forest Model Specification (classification) ## ## Main Arguments: ## mtry = tune() ## trees = 100 ## min_n = tune() ## ## Computational engine: ranger ``` ] ] --- ### Grid Search We'll create a grid of possible hyperparameters and then estimate how well they fit with our resamples. Note that this took about 20 min to run! I'm doing 15x models by 5x folds, where we train a model and predict outcomes each time! The beauty here is that you could run this as a background job. ```r set.seed(20210430) penguin_folds <- vfold_cv(penguin_train, v = 5) tune_res <- tune_grid( tune_rf_wf, resamples = penguin_folds, grid = 15, # 15 combos of model parameters control = control_grid(verbose = TRUE) ) ``` --- ### Grid Search Here are the results! ```r tune_res ``` ``` ## # Tuning results ## # 5-fold cross-validation ## # A tibble: 5 × 4 ## splits id .metrics .notes ## <list> <chr> <list> <list> ## 1 <split [199/50]> Fold1 <tibble [30 × 6]> <tibble [0 × 1]> ## 2 <split [199/50]> Fold2 <tibble [30 × 6]> <tibble [0 × 1]> ## 3 <split [199/50]> Fold3 <tibble [30 × 6]> <tibble [0 × 1]> ## 4 <split [199/50]> Fold4 <tibble [30 × 6]> <tibble [0 × 1]> ## 5 <split [200/49]> Fold5 <tibble [30 × 6]> <tibble [0 × 1]> ``` --- ### Check it out It's nested tibbles for the split data, the fold id, metrics, and any notes. ```r # Essentially the same as tune_res[[".metrics"]][[1]] tune_res %>% pluck(".metrics", 3) ``` ``` ## # A tibble: 30 × 6 ## mtry min_n .metric .estimator .estimate .config ## <int> <int> <chr> <chr> <dbl> <chr> ## 1 2 7 accuracy binary 0.9 Preprocessor1_Model01 ## 2 2 7 roc_auc binary 0.976 Preprocessor1_Model01 ## 3 3 16 accuracy binary 0.88 Preprocessor1_Model02 ## 4 3 16 roc_auc binary 0.975 Preprocessor1_Model02 ## 5 4 21 accuracy binary 0.9 Preprocessor1_Model03 ## 6 4 21 roc_auc binary 0.968 Preprocessor1_Model03 ## 7 2 38 accuracy binary 0.86 Preprocessor1_Model04 ## 8 2 38 roc_auc binary 0.971 Preprocessor1_Model04 ## 9 1 8 accuracy binary 0.9 Preprocessor1_Model05 ## 10 1 8 roc_auc binary 0.985 Preprocessor1_Model05 ## # … with 20 more rows ``` --- ### Check it out .small[ .pull-left[ ```r plot_tuned <- tune_res %>% collect_metrics() %>% filter(.metric == "roc_auc") %>% dplyr::select(mean, mtry:min_n) %>% pivot_longer(mtry:min_n, values_to = "value", names_to = "parameter" ) %>% ggplot(aes(value, mean, color = parameter)) + geom_point(alpha = 0.8, show.legend = FALSE) + facet_wrap(~parameter, scales = "free_x", ncol = 1) + labs(x = NULL, y = "AUC") ``` ] ] .tiny[ .pull-right[ <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-61-1.png" width="288" /> ] ] --- ### Check it out (scaling matters!) .small[ .pull-left[ ```r plot_tuned <- tune_res %>% collect_metrics() %>% filter(.metric == "roc_auc") %>% dplyr::select(mean, mtry:min_n) %>% pivot_longer(mtry:min_n, values_to = "value", names_to = "parameter" ) %>% ggplot(aes(value, mean, color = parameter)) + geom_point(alpha = 0.8, show.legend = FALSE) + facet_wrap(~parameter, scales = "free_x", ncol = 1) + labs(x = NULL, y = "AUC") ``` ] ] .small[ .pull-right[ ```r plot_tuned + scale_y_continuous(limits = c(0.75, 0.98)) ``` <img src="data:image/png;base64,#index_files/figure-html/unnamed-chunk-63-1.png" width="288" /> ] ] --- ### Finalize Here we are investigating which hyperparameters maximized ROC Area Under the Curve. ```r # Which 5x were best? show_best(tune_res, "roc_auc", n = 5) ``` ``` ## # A tibble: 5 × 8 ## mtry min_n .metric .estimator mean n std_err .config ## <int> <int> <chr> <chr> <dbl> <int> <dbl> <chr> ## 1 1 3 roc_auc binary 0.982 5 0.00437 Preprocessor1_Model12 ## 2 1 8 roc_auc binary 0.981 5 0.00540 Preprocessor1_Model05 ## 3 2 7 roc_auc binary 0.977 5 0.00609 Preprocessor1_Model01 ## 4 2 12 roc_auc binary 0.975 5 0.00701 Preprocessor1_Model07 ## 5 3 11 roc_auc binary 0.974 5 0.00601 Preprocessor1_Model14 ``` ```r # Select the best best_fit_auc <- select_best(tune_res, "roc_auc") # Select wflow for the model with best hyperparams rf_tuned <- finalize_workflow( rf_wflow, parameters = best_fit_auc ) ``` --- ### Finalize Show the outcomes! ```r set.seed(20201024) rf_tuned_fit <- last_fit(rf_tuned, penguin_split) rf_tuned_fit %>% # tuned model metrics collect_metrics() ``` ``` ## # A tibble: 2 × 4 ## .metric .estimator .estimate .config ## <chr> <chr> <dbl> <chr> ## 1 accuracy binary 0.857 Preprocessor1_Model1 ## 2 roc_auc binary 0.928 Preprocessor1_Model1 ``` ```r rf_compare_df # naive model metrics ``` ``` ## # A tibble: 2 × 3 ## .metric .estimator .estimate ## <chr> <chr> <dbl> ## 1 accuracy binary 0.881 ## 2 roc_auc binary 0.927 ``` --- ### Addendums - Model training/fitting (or simulation) is likely to be the most time-intensive computation you do - as such, it's a good idea to run them as [**background jobs** in RStudio](https://blog.rstudio.com/2019/03/14/rstudio-1-2-jobs/) - Also can turn on verbose reporting so you know where you're at in the Cross-validation or tuning steps - `control_grid(verbose = TRUE)` --- ### Thank you * All y'all for listening in 🤠 ### Learn more * [`tidymodels`.org](https://www.tidymodels.org/learn/) has step by step guides of various complexities * Julia Silge's (a `tidymodels` maintainer) [blog](https://juliasilge.com/), [video series](https://www.youtube.com/channel/UCTTBgWyJl2HrrhQOOc710kA), or [free interactive course](https://supervised-ml-course.netlify.app/) * [Tidy Modeling with R](https://www.tmwr.org/) - get started quickly with `tidymodels` - [Introduction to Statistical Learning](https://www.statlearning.com/) - understand the math (new edition just came out!) - [Hands on Machine Learning with R](https://bradleyboehmke.github.io/HOML/) - get started quickly with modeling in R (mix of base R, `caret`, and `tidymodels`)