How to improve your nflscrapR graphics

This resource is modeled after the fantastic BBC Graphics Cookbook, which is also worth checking out. The nflscrapR team (Maksim Horowitz, Ron Yurko, and Sam Ventura) have compiled easy to access play-by-play stats opening a deeper world of NFL analytics for reporters, bloggers and enthusiasts (and probably some NFL teams). Ben Baldwin has compiled a quickstart guide to using this data. As such, this resource is not aimed at reproducing that tutorial, but giving you some quick guides for improving the graphics you create via ggplot2. It’s easy to get started quickly exploring the data with ggplot2 and hopefully this helps with your “publication” quality plots.

I am providing a lot of my own opinion on certain dataviz choices - everyone is allowed to make their own decisions with regards to colors, ink use, chart type - but I do hope that this resource opens your eyes to some of the art of dataviz now that you have made progress with the science.

The source code for this webpage is on Github if you want to take a look.

Additional Resources

If you’d rather go deeper into a textbook and ignore specific applications related to nflscrapR, check out these amazing free online resources (some available in print as well):

Title/Link Author Description
R for Data Science Hadley Wickham, Garret Grolemund A great overview of the tidyverse, covers everything from reading data in, data manipulation/summarization, data viz, and general programming in R
SocViz Kieran Hiely Covers exactly HOW to create a lot of different plot types in R/ggplot2
Fundamentals of Data Viz Claus Wilke Covers the WHY of Data Viz where all examples are in R, but no code examples in the book, but are available on his GitHub
BBPlot Cookbook BBC Data Team Intro primer to news-style graphics in ggplot2
ggplot2 cookbook Winston Chang Quick cookbook of ggplot2 plots
R Graph Gallery Yan Holtz Cookbook examples of a majority of plot types.
ggplot2 Book Hadley Wickham, Danielle Navarro This 3rd edition of the ggplot2 book is currently under development, but also available freely online for the first time! A more technical book that should align well with either SocViz or Fundamentals of Data Viz

Useful code chunks

There are a couple features that we will use throughout these examples:

dplyr::if_else()

This allows you to make a binary conversion.

For example if_else(condition, true, false)

  • mutate(success = if_else(epa > 0, 1, 0))
  • mutate(color = if_else(posteam == "PIT", "yellow", "grey))

dplyr::case_when()

This allows you to essentially use many if_else statements at once

  • The ~ indicates an assignment, where if the left side statement is evaluated as TRUE then the outcome is ~ (assigned) to the right side.
    • The right side can be a number, text, etc
    • The left side can be a simple or complex statement, but must evaluate as TRUE/FALSE (logical)
  • The final TRUE ~ NA_character_ is basically a “catch” - if none of the other cases are met, then it will default to NA
    • In this case we use NA_character_ from dplyr, but you could also have a situation where it could simply say “nope” or revert back to some other column
    • If you want to have the right side (assignment) be a number, you’ll need to use NA_integer_
  • Lastly, a longer case_when() is presented shortly below
pbp %>%
  mutate(
    stick_throw = case_when(
      air_yards < ydstogo ~ "Short of Sticks",
      air_yards == ydstogo ~ "At Stick",
      air_yards > ydstogo ~ "Past Stick",
      TRUE ~ NA_character_
    )
  ) %>%
  select(air_yards, ydstogo, stick_throw) %>%
  filter(!is.na(air_yards))
## # A tibble: 17,669 x 3
##    air_yards ydstogo stick_throw    
##        <dbl>   <dbl> <chr>          
##  1         8      15 Short of Sticks
##  2         4      10 Short of Sticks
##  3        -3      10 Short of Sticks
##  4        24      10 Past Stick     
##  5         1       1 At Stick       
##  6         4       8 Short of Sticks
##  7         6       4 Past Stick     
##  8        16      10 Past Stick     
##  9        -9      13 Short of Sticks
## 10         2      10 Short of Sticks
## # … with 17,659 more rows

scale_color_identity()

This is useful in combination with the above example of assigning color in a plot, essentially it will take the “yellow” or “grey” argument automatically.

scale_color_manual()

This allows you to specify colors of interest like scale_color_manual(values = c("red", "black"))

forcats::reorder()

This allows you to reorder levels of a ggplot by another variable.

eg reorder(posteam, epa)

Helpers

There are a few helpers used frequently throughout.

  • ! indicates not or negation, so x != 5 means x not equal to 5.
    • !is.na(x) indicates x is NOT NA
  • %in% means in - so x %in% c(2, 3, 4) means x matches 2, 3 OR 4
  • dplyr::between(x, left, right) - shortcut for x >= left & x <= right
  • hjust/vjust - this is typically assigned 0 through 1, and adjusts either the horizontal or vertical alignment

ggplot2 specs

The documentation for ggplot2 cover in great detail MANY options for minor but important customizations. I’m not adding it directly here but adding as a resource. It is definitely worth parsing through, and some examples below:

  • lines (size, color, type, join, end)
  • points (size, color, fill, stroke)
  • text (size, face)
  • justification (hjust, vjust, nudge_x, nudge_y)

teamcolors package

Gives you ALL the colors for NFL teams

Using teamcolors

filter(teamcolors, league == "nfl")
## # A tibble: 32 x 8
##    name    league primary secondary tertiary quaternary division logo      
##    <chr>   <chr>  <chr>   <chr>     <chr>    <chr>      <chr>    <chr>     
##  1 Arizon… nfl    #97233f #000000   #ffb612  #a5acaf    NFC West http://co…
##  2 Atlant… nfl    #a71930 #000000   #a5acaf  #a30d2d    NFC Sou… http://co…
##  3 Baltim… nfl    #241773 #000000   #9e7c0c  #c60c30    AFC Nor… http://co…
##  4 Buffal… nfl    #00338d #c60c30   #0c2e82  #d50a0a    AFC East http://co…
##  5 Caroli… nfl    #0085ca #000000   #bfc0bf  #0085ca    NFC Sou… http://co…
##  6 Chicag… nfl    #0b162a #c83803   #0b162a  #c83803    NFC Nor… http://co…
##  7 Cincin… nfl    #000000 #fb4f14   #000000  #d32f1e    AFC Nor… http://co…
##  8 Clevel… nfl    #fb4f14 #22150c   #a5acaf  #d32f1e    AFC Nor… http://co…
##  9 Dallas… nfl    #002244 #b0b7bc   #acc0c6  #a5acaf    NFC East http://co…
## 10 Denver… nfl    #002244 #fb4f14   #00234c  #ff5200    AFC West http://co…
## # … with 22 more rows

Please note that teams are listed by full name so to use them with the play-by-play data you will need to “join” the teamcolors and play-by-play datasets together.

The list of short teams named could be accomplished like so:

nfl_colors <- teamcolors %>%
  filter(league == "nfl") %>%
  mutate(
    team_abb = case_when(
      name == "Arizona Cardinals" ~ "ARI",
      name == "Atlanta Falcons" ~ "ATL",
      name == "Baltimore Ravens" ~ "BAL",
      name == "Buffalo Bills" ~ "BUF",
      name == "Carolina Panthers" ~ "CAR",
      name == "Chicago Bears" ~ "CHI",
      name == "Cincinnati Bengals" ~ "CIN",
      name == "Cleveland Browns" ~ "CLE",
      name == "Dallas Cowboys" ~ "DAL",
      name == "Denver Broncos" ~ "DEN",
      name == "Detroit Lions" ~ "DET",
      name == "Green Bay Packers" ~ "GB",
      name == "Houston Texans" ~ "HOU",
      name == "Indianapolis Colts" ~ "IND",
      name == "Jacksonville Jaguars" ~ "JAX",
      name == "Kansas City Chiefs" ~ "KC",
      name == "Los Angeles Rams" ~ "LA",
      name == "Los Angeles Chargers" ~ "LAC",
      name == "Miami Dolphins" ~ "MIA",
      name == "Minnesota Vikings" ~ "MIN",
      name == "New England Patriots" ~ "NE",
      name == "New Orleans Saints" ~ "NO",
      name == "New York Giants" ~ "NYG",
      name == "New York Jets" ~ "NYJ",
      name == "Oakland Raiders" ~ "OAK",
      name == "Philadelphia Eagles" ~ "PHI",
      name == "Pittsburgh Steelers" ~ "PIT",
      name == "Seattle Seahawks" ~ "SEA",
      name == "San Francisco 49ers" ~ "SF",
      name == "Tampa Bay Buccaneers" ~ "TB",
      name == "Tennessee Titans" ~ "TEN",
      name == "Washington Redskins" ~ "WAS",
      TRUE ~ NA_character_
    ),
    posteam = team_abb
  )

You could then use dplyr::left_join() to join the full names, colors, and team logos to the play-by-play data. Without getting into the weeds TOO much, a left_join basically finds cases where there is a matching row in the common column (posteam) for both dataframes, and then adds the additional columns from nfl_colors to the play-by-play data. Joins are a very important concept when trying to combine multiple datasets, and if you want to read more about the various types and their use cases check out the dplyr joins docs.

Quick example below:

# read in data
pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# left_join the data together
pbp_colors <- left_join(pbp, nfl_colors, by = c("posteam"))

pbp_colors %>%
  # Excludes non-plays, eg end of quarter
  filter(!is.na(posteam)) %>%
  select(posteam, team_abb, name, primary, secondary, logo) %>%
  # Distinct grabs only the distinct/unique cases of column
  distinct(posteam, .keep_all = TRUE)
## # A tibble: 32 x 6
##    posteam team_abb name       primary secondary logo                      
##    <chr>   <chr>    <chr>      <chr>   <chr>     <chr>                     
##  1 ATL     ATL      Atlanta F… #a71930 #000000   http://content.sportslogo…
##  2 PHI     PHI      Philadelp… #004953 #a5acaf   http://content.sportslogo…
##  3 BAL     BAL      Baltimore… #241773 #000000   http://content.sportslogo…
##  4 BUF     BUF      Buffalo B… #00338d #c60c30   http://content.sportslogo…
##  5 JAX     JAX      Jacksonvi… #000000 #006778   http://content.sportslogo…
##  6 NYG     NYG      New York … #0b2265 #a71930   http://content.sportslogo…
##  7 NO      NO       New Orlea… #9f8958 #000000   http://content.sportslogo…
##  8 TB      TB       Tampa Bay… #d50a0a #34302b   http://content.sportslogo…
##  9 NE      NE       New Engla… #002244 #c60c30   http://content.sportslogo…
## 10 HOU     HOU      Houston T… #03202f #a71930   http://content.sportslogo…
## # … with 22 more rows

So we can see that the posteam and team_abb are equivalent, where the full team name, colors, and logo are also added. I dropped the other 250+ columns for printing here, but they would be in the complete dataframe.

ggsave()

If you are going to export your graphics, it’s worth it to go through ggsave() rather than the RStudio export button.

The full docs have lots of great info but I’ll summarize it here. The basic arguments in pseudocode are below.

ggsave("plot_name.png", plot_object,
       height = x, width = y, units = "in", dpi = "300")

A typical call of ggsave would look like the below.

ggsave("wr_epa.png", wr_epa_plot, 
       height = 6, width = 8, units = "in", dpi = "350")

Arguably, the most important part is the DPI call - if you save through the export button you will typically have a low DPI (72) that has jagged edges on lines, as opposed to exporting with a higher DPI.

You will likely spend some time perfecting the print size of your plots, but if you use your own theme with text sized appropriately you can typically set a specific DPI and work from there.

Changing fonts

Changing fonts for graphics in R can be easy if you use a package like extrafont or showtext. You can then change font family in your theme calls or as part of your personal theme.

extrafont has an example walking through it’s use.

showtext has an example walking through it’s use.

Prep

Load all the libraries you need

There are a few packages I will use in this guide, most of them related to data viz.

library(tidyverse) # Data Cleaning, manipulation, summarization, plotting
library(gt) # beautiful tables
library(DT) # beautiful interactive tables
library(ggthemes) # custom pre-built themes
library(bbplot) # more themes
library(ggtext) # custom text color
library(teamcolors) # NFL team colors and logos
library(ggforce) # better annotations
library(ggridges) # many distributions at once
library(ggrepel) # better labels
library(ggbeeswarm) # beeswarm plots
library(extrafont) # for extra fonts

Read in the pbp data

This is taken almost verbatim from Ben’s Tutorial, but the idea is that you are adjusting the dataset to be ready for analysis. If you are interested in plays beyond pass/rush then you should probably NOT do these steps.

pbp <- read_csv("https://raw.githubusercontent.com/ryurko/nflscrapR-data/master/play_by_play_data/regular_season/reg_pbp_2018.csv")
# clean up the data for further analysis
pbp_rp <- pbp %>%
  # grab only penalties, pass, and run plays
  filter(!is.na(epa), play_type == "no_play" | play_type == "pass" | play_type == "run") %>%
  # create pass, rush and success columns
  mutate(
    pass = if_else(str_detect(desc, "(pass)|(sacked)|(scramble)"), 1, 0),
    rush = if_else(str_detect(desc, "(left end)|(left tackle)|(left guard)|(up the middle)|(right guard)|(right tackle)|(right end)") & pass == 0, 1, 0),
    success = ifelse(epa > 0, 1, 0)
  ) %>%
  # filter to only pass or rush plays
  filter(pass == 1 | rush == 1) %>%
  mutate(
    passer_player_name = ifelse(play_type == "no_play" & pass == 1,
      str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((pass)|(sack)|(scramble)))"),
      passer_player_name
    ),
    receiver_player_name = ifelse(play_type == "no_play" & str_detect(desc, "pass"),
      str_extract(
        desc,
        "(?<=to\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?"
      ),
      receiver_player_name
    ),
    rusher_player_name = ifelse(play_type == "no_play" & rush == 1,
      str_extract(desc, "(?<=\\s)[A-Z][a-z]*\\.\\s?[A-Z][A-z]+(\\s(I{2,3})|(IV))?(?=\\s((left end)|(left tackle)|(left guard)|      (up the middle)|(right guard)|(right tackle)|(right end)))"),
      rusher_player_name
    )
  ) %>%
  mutate(
    name = if_else(!is.na(passer_player_name), passer_player_name, rusher_player_name),
    rusher = rusher_player_name,
    receiver = receiver_player_name,
    play = 1
  )

Our first data summary

This is also credited to Ben:

“Let’s look at which teams were the most pass-heavy in the first half on early downs with win probability between 20 and 80, excluding the final 2 minutes of the half when everyone is pass-happy:”

schotty <- pbp_rp %>%
  filter(wp > .20 & wp < .80 & down <= 2 & qtr <= 2 & half_seconds_remaining > 120) %>%
  group_by(posteam) %>%
  summarize(mean_pass = mean(pass), 
            plays = n()) %>%
  arrange(mean_pass)

schotty
## # A tibble: 32 x 3
##    posteam mean_pass plays
##    <chr>       <dbl> <int>
##  1 SEA         0.369   320
##  2 JAX         0.435   276
##  3 TEN         0.441   263
##  4 BUF         0.452   219
##  5 BAL         0.458   299
##  6 ARI         0.466   236
##  7 NYJ         0.473   256
##  8 DET         0.482   299
##  9 WAS         0.485   239
## 10 CAR         0.491   281
## # … with 22 more rows

“The Seahawks were playing a different sport in 2018. Fun! Let’s see what that looks like:”

ggplot(schotty, aes(x = reorder(posteam,-mean_pass), y = mean_pass)) +
        geom_text(aes(label = posteam))