How to improve your nflfastR graphics

This resource is modeled after the fantastic BBC Graphics Cookbook, which is also worth checking out. The nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) have compiled easy to access play-by-play stats opening a deeper world of NFL analytics for reporters, bloggers and enthusiasts (and probably some NFL teams). This work has been extended in nflfastR by Sebastian Carl and Ben Baldwin. Ben Baldwin has compiled a quickstart guide to using this data. As such, this resource is not aimed at reproducing that tutorial, but giving you some quick guides for improving the graphics you create via ggplot2. It’s easy to get started quickly exploring the data with ggplot2 and hopefully this helps with your “publication” quality plots.

I am providing a lot of my own opinion on certain dataviz choices - everyone is allowed to make their own decisions with regards to colors, ink use, chart type - but I do hope that this resource opens your eyes to some of the art of dataviz now that you have made progress with the science.

The source code for this webpage is on Github if you want to take a look.

Additional Resources

If you’d rather go deeper into a textbook and ignore specific applications related to nflscrapR, check out these amazing free online resources (some available in print as well):

Title/Link Author Description
R for Data Science Hadley Wickham, Garret Grolemund A great overview of the tidyverse, covers everything from reading data in, data manipulation/summarization, data viz, and general programming in R
SocViz Kieran Hiely Covers exactly HOW to create a lot of different plot types in R/ggplot2
Fundamentals of Data Viz Claus Wilke Covers the WHY of Data Viz where all examples are in R, but no code examples in the book, but are available on his GitHub
BBPlot Cookbook BBC Data Team Intro primer to news-style graphics in ggplot2
ggplot2 cookbook Winston Chang Quick cookbook of ggplot2 plots
R Graph Gallery Yan Holtz Cookbook examples of a majority of plot types.
ggplot2 Book Hadley Wickham, Danielle Navarro This 3rd edition of the ggplot2 book is currently under development, but also available freely online for the first time! A more technical book that should align well with either SocViz or Fundamentals of Data Viz

Useful code chunks

There are a couple features that we will use throughout these examples:

dplyr::if_else()

This allows you to make a binary conversion.

For example if_else(condition, true, false)

  • mutate(success = if_else(epa > 0, 1, 0))
  • mutate(color = if_else(posteam == "PIT", "yellow", "grey))

dplyr::case_when()

This allows you to essentially use many if_else statements at once

  • The ~ indicates an assignment, where if the left side statement is evaluated as TRUE then the outcome is ~ (assigned) to the right side.
    • The right side can be a number, text, etc
    • The left side can be a simple or complex statement, but must evaluate as TRUE/FALSE (logical)
  • The final TRUE ~ NA_character_ is basically a “catch” - if none of the other cases are met, then it will default to NA
    • In this case we use NA_character_ from dplyr, but you could also have a situation where it could simply say “nope” or revert back to some other column
    • If you want to have the right side (assignment) be a number, you’ll need to use NA_integer_
  • Lastly, a longer case_when() is presented shortly below
pbp %>%
  mutate(
    stick_throw = case_when(
      air_yards < ydstogo ~ "Short of Sticks",
      air_yards == ydstogo ~ "At Stick",
      air_yards > ydstogo ~ "Past Stick",
      TRUE ~ NA_character_
    )
  ) %>%
  select(air_yards, ydstogo, stick_throw) %>%
  filter(!is.na(air_yards))
## # A tibble: 18,555 x 3
##    air_yards ydstogo stick_throw    
##        <dbl>   <dbl> <chr>          
##  1         1      20 Short of Sticks
##  2        11      12 Short of Sticks
##  3        13      10 Past Stick     
##  4         1      10 Short of Sticks
##  5        -9      10 Short of Sticks
##  6         0       8 Short of Sticks
##  7        15      10 Past Stick     
##  8         5      10 Short of Sticks
##  9         5       5 At Stick       
## 10         7      11 Short of Sticks
## # … with 18,545 more rows

scale_color_identity()

This is useful in combination with the above example of assigning color in a plot, essentially it will take the “yellow” or “grey” argument automatically.

scale_color_manual()

This allows you to specify colors of interest like scale_color_manual(values = c("red", "black"))

forcats::reorder()

This allows you to reorder levels of a ggplot by another variable.

eg reorder(posteam, epa)

Helpers

There are a few helpers used frequently throughout.

  • ! indicates not or negation, so x != 5 means x not equal to 5.
    • !is.na(x) indicates x is NOT NA
  • %in% means in - so x %in% c(2, 3, 4) means x matches 2, 3 OR 4
  • dplyr::between(x, left, right) - shortcut for x >= left & x <= right
  • hjust/vjust - this is typically assigned 0 through 1, and adjusts either the horizontal or vertical alignment

ggplot2 specs

The documentation for ggplot2 cover in great detail MANY options for minor but important customizations. I’m not adding it directly here but adding as a resource. It is definitely worth parsing through, and some examples below:

  • lines (size, color, type, join, end)
  • points (size, color, fill, stroke)
  • text (size, face)
  • justification (hjust, vjust, nudge_x, nudge_y)

teamcolors package

Gives you ALL the colors for NFL teams, although nflfastR also provides colors and logos via nflfastR::teams_colors_logos.

Using teamcolors

filter(teamcolors, league == "nfl")
## # A tibble: 32 x 11
##    name  league primary secondary tertiary quaternary division location mascot
##    <chr> <chr>  <chr>   <chr>     <chr>    <chr>      <chr>    <chr>    <chr> 
##  1 Ariz… nfl    #97233f #000000   #ffb612  #a5acaf    NFC West Arizona  Cardi…
##  2 Atla… nfl    #a71930 #000000   #a5acaf  #a30d2d    NFC Sou… Atlanta  Falco…
##  3 Balt… nfl    #241773 #000000   #9e7c0c  #c60c30    AFC Nor… Baltimo… Ravens
##  4 Buff… nfl    #00338d #c60c30   #0c2e82  #d50a0a    AFC East Buffalo  Bills 
##  5 Caro… nfl    #0085ca #000000   #bfc0bf  <NA>       NFC Sou… Carolina Panth…
##  6 Chic… nfl    #0b162a #c83803   <NA>     <NA>       NFC Nor… Chicago  Bears 
##  7 Cinc… nfl    #000000 #fb4f14   #d32f1e  <NA>       AFC Nor… Cincinn… Benga…
##  8 Clev… nfl    #fb4f14 #22150c   #a5acaf  #d32f1e    AFC Nor… Clevela… Browns
##  9 Dall… nfl    #002244 #b0b7bc   #acc0c6  #a5acaf    NFC East Dallas   Cowbo…
## 10 Denv… nfl    #002244 #fb4f14   #00234c  #ff5200    AFC West Denver   Bronc…
## # … with 22 more rows, and 2 more variables: sportslogos_name <chr>, logo <chr>
# or
nflfastR::teams_colors_logos
## # A tibble: 36 x 10
##    team_abbr team_name team_id team_nick team_color team_color2 team_color3
##    <chr>     <chr>     <chr>   <chr>     <chr>      <chr>       <chr>      
##  1 ARI       Arizona … 3800    Cardinals #97233f    #000000     #ffb612    
##  2 ATL       Atlanta … 0200    Falcons   #a71930    #000000     #a5acaf    
##  3 BAL       Baltimor… 0325    Ravens    #241773    #000000     #9e7c0c    
##  4 BUF       Buffalo … 0610    Bills     #00338d    #c60c30     #0c2e82    
##  5 CAR       Carolina… 0750    Panthers  #0085ca    #000000     #bfc0bf    
##  6 CHI       Chicago … 0810    Bears     #0b162a    #c83803     #0b162a    
##  7 CIN       Cincinna… 0920    Bengals   #000000    #fb4f14     #000000    
##  8 CLE       Clevelan… 1050    Browns    #fb4f14    #22150c     #a5acaf    
##  9 DAL       Dallas C… 1200    Cowboys   #002244    #b0b7bc     #acc0c6    
## 10 DEN       Denver B… 1400    Broncos   #002244    #fb4f14     #00234c    
## # … with 26 more rows, and 3 more variables: team_color4 <chr>,
## #   team_logo_wikipedia <chr>, team_logo_espn <chr>

Please note that teams are listed by full name so to use them with the play-by-play data you will need to “join” the teamcolors and play-by-play datasets together.

The list of short teams named could be accomplished like so:

left_join(pbp, nflfastR::teams_colors_logos, by = c("posteam" = "team_abbr"))
## # A tibble: 48,034 x 349
##    play_id game_id old_game_id home_team away_team season_type  week posteam
##      <dbl> <chr>         <dbl> <chr>     <chr>     <chr>       <dbl> <chr>  
##  1       1 2019_0…  2019090804 MIN       ATL       REG             1 <NA>   
##  2      36 2019_0…  2019090804 MIN       ATL       REG             1 ATL    
##  3      51 2019_0…  2019090804 MIN       ATL       REG             1 ATL    
##  4      79 2019_0…  2019090804 MIN       ATL       REG             1 ATL    
##  5     100 2019_0…  2019090804 MIN       ATL       REG             1 ATL    
##  6     121 2019_0…  2019090804 MIN       ATL       REG             1 ATL    
##  7     148 2019_0…  2019090804 MIN       ATL       REG             1 MIN    
##  8     185 2019_0…  2019090804 MIN       ATL       REG             1 MIN    
##  9     214 2019_0…  2019090804 MIN       ATL       REG             1 MIN    
## 10     239 2019_0…  2019090804 MIN       ATL       REG             1 MIN    
## # … with 48,024 more rows, and 341 more variables: posteam_type <chr>,
## #   defteam <chr>, side_of_field <chr>, yardline_100 <dbl>, game_date <date>,
## #   quarter_seconds_remaining <dbl>, half_seconds_remaining <dbl>,
## #   game_seconds_remaining <dbl>, game_half <chr>, quarter_end <dbl>,
## #   drive <dbl>, sp <dbl>, qtr <dbl>, down <dbl>, goal_to_go <dbl>,
## #   time <time>, yrdln <chr>, ydstogo <dbl>, ydsnet <dbl>, desc <chr>,
## #   play_type <chr>, yards_gained <dbl>, shotgun <dbl>, no_huddle <dbl>,
## #   qb_dropback <dbl>, qb_kneel <dbl>, qb_spike <dbl>, qb_scramble <dbl>,
## #   pass_length <chr>, pass_location <chr>, air_yards <dbl>,
## #   yards_after_catch <dbl>, run_location <chr>, run_gap <chr>,
## #   field_goal_result <chr>, kick_distance <dbl>, extra_point_result <chr>,
## #   two_point_conv_result <chr>, home_timeouts_remaining <dbl>,
## #   away_timeouts_remaining <dbl>, timeout <dbl>, timeout_team <chr>,
## #   td_team <chr>, posteam_timeouts_remaining <dbl>,
## #   defteam_timeouts_remaining <dbl>, total_home_score <dbl>,
## #   total_away_score <dbl>, posteam_score <dbl>, defteam_score <dbl>,
## #   score_differential <dbl>, posteam_score_post <dbl>,
## #   defteam_score_post <dbl>, score_differential_post <dbl>,
## #   no_score_prob <dbl>, opp_fg_prob <dbl>, opp_safety_prob <dbl>,
## #   opp_td_prob <dbl>, fg_prob <dbl>, safety_prob <dbl>, td_prob <dbl>,
## #   extra_point_prob <dbl>, two_point_conversion_prob <dbl>, ep <dbl>,
## #   epa <dbl>, total_home_epa <dbl>, total_away_epa <dbl>,
## #   total_home_rush_epa <dbl>, total_away_rush_epa <dbl>,
## #   total_home_pass_epa <dbl>, total_away_pass_epa <dbl>, air_epa <dbl>,
## #   yac_epa <dbl>, comp_air_epa <dbl>, comp_yac_epa <dbl>,
## #   total_home_comp_air_epa <dbl>, total_away_comp_air_epa <dbl>,
## #   total_home_comp_yac_epa <dbl>, total_away_comp_yac_epa <dbl>,
## #   total_home_raw_air_epa <dbl>, total_away_raw_air_epa <dbl>,
## #   total_home_raw_yac_epa <dbl>, total_away_raw_yac_epa <dbl>, wp <dbl>,
## #   def_wp <dbl>, home_wp <dbl>, away_wp <dbl>, wpa <dbl>, home_wp_post <dbl>,
## #   away_wp_post <dbl>, vegas_wp <dbl>, vegas_home_wp <dbl>,
## #   total_home_rush_wpa <dbl>, total_away_rush_wpa <dbl>,
## #   total_home_pass_wpa <dbl>, total_away_pass_wpa <dbl>, air_wpa <dbl>,
## #   yac_wpa <dbl>, comp_air_wpa <dbl>, comp_yac_wpa <dbl>,
## #   total_home_comp_air_wpa <dbl>, …

You could then use dplyr::left_join() to join the full full_team_names, colors, and team logos to the play-by-play data. Without getting into the weeds TOO much, a left_join basically finds cases where there is a matching row in the common column (posteam) for both dataframes, and then adds the additional columns from nfl_colors to the play-by-play data. Joins are a very important concept when trying to combine multiple datasets, and if you want to read more about the various types and their use cases check out the dplyr joins docs.

Quick example below:

# read in data
pbp <- read_csv("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.csv.gz")
# left_join the data together
pbp_colors <- left_join(pbp, nflfastR::teams_colors_logos, by = c("posteam" = "team_abbr"))

pbp_colors %>%
  # Excludes non-plays, eg end of quarter
  filter(!is.na(posteam)) %>%
  select(posteam, team_name, team_color, team_color2, team_logo_wikipedia) %>%
  # Distinct grabs only the distinct/unique cases of column
  distinct(posteam, .keep_all = TRUE)
## # A tibble: 32 x 5
##    posteam team_name    team_color team_color2 team_logo_wikipedia              
##    <chr>   <chr>        <chr>      <chr>       <chr>                            
##  1 ATL     Atlanta Fal… #a71930    #000000     https://upload.wikimedia.org/wik…
##  2 MIN     Minnesota V… #4f2683    #ffc62f     https://upload.wikimedia.org/wik…
##  3 BAL     Baltimore R… #241773    #000000     https://upload.wikimedia.org/wik…
##  4 MIA     Miami Dolph… #008e97    #f58220     https://upload.wikimedia.org/wik…
##  5 BUF     Buffalo Bil… #00338d    #c60c30     https://upload.wikimedia.org/wik…
##  6 NYJ     New York Je… #203731    #1c2d25     https://upload.wikimedia.org/wik…
##  7 CIN     Cincinnati … #000000    #fb4f14     https://upload.wikimedia.org/wik…
##  8 SEA     Seattle Sea… #002244    #69be28     https://upload.wikimedia.org/wik…
##  9 LV      Las Vegas R… #a5acaf    #000000     https://upload.wikimedia.org/wik…
## 10 DEN     Denver Bron… #002244    #fb4f14     https://upload.wikimedia.org/wik…
## # … with 22 more rows

So we can see that the posteam and team_abb are equivalent, where the full team name, colors, and logo are also added. I dropped the other 250+ columns for printing here, but they would be in the complete dataframe.

ggsave()

If you are going to export your graphics, it’s worth it to go through ggsave() rather than the RStudio export button.

The full docs have lots of great info but I’ll summarize it here. The basic arguments in pseudocode are below.

ggsave("plot_name.png", plot_object,
       height = x, width = y, units = "in", dpi = "300")

A typical call of ggsave would look like the below.

ggsave("wr_epa.png", wr_epa_plot, 
       height = 6, width = 8, units = "in", dpi = "350")

Arguably, the most important part is the DPI call - if you save through the export button you will typically have a low DPI (72) that has jagged edges on lines (known as aliasing), as opposed to exporting with a higher DPI which will give a higher quality appearance.

You will likely spend some time perfecting the print size of your plots, but if you use your own theme with text sized appropriately you can typically set a specific DPI and work from there.

Changing fonts

Changing fonts for graphics in R can be easy if you use a package like extrafont or showtext. You can then change font family in your theme calls or as part of your personal theme.

extrafont has an example walking through it’s use.

showtext has an example walking through it’s use.

Prep

Load all the libraries you need

There are a few packages I will use in this guide, most of them related to data viz.

library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm) # beeswarm plots
library(extrafont) # for extra fonts

Read in the pbp data

This is taken almost verbatim from Ben’s Tutorial, but the idea is that you are adjusting the dataset to be ready for analysis. If you are interested in plays beyond pass/rush then you should probably NOT do these steps.

pbp <- read_csv("https://raw.githubusercontent.com/guga31bb/nflfastR-data/master/data/play_by_play_2019.csv.gz")

Our first data summary

This is also credited to Ben:

“Let’s look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:”

kc <- pbp %>%
  filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
  group_by(posteam) %>%
  summarize(mean_pass = mean(pass), 
            plays = n()) %>%
  arrange(mean_pass)
## `summarise()` ungrouping output (override with `.groups` argument)
kc
## # A tibble: 32 x 3
##    posteam mean_pass plays
##    <chr>       <dbl> <int>
##  1 BAL         0.427   314
##  2 IND         0.432   301
##  3 MIN         0.450   302
##  4 WAS         0.454   273
##  5 TEN         0.466   343
##  6 LV          0.472   316
##  7 PIT         0.477   277
##  8 SEA         0.481   339
##  9 SF          0.482   330
## 10 CIN         0.483   261
## # … with 22 more rows

“Kansas City led the league in passing rate 2019. Fun! Let’s see what that looks like:”

ggplot(kc, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
  geom_text(aes(label = posteam))