Scraping Complex Tables from PDFs with PDF Tools

The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.

Load Libraries

We’ll use ROpenSci’s pdftools package along with several tidyverse packages: - stringr - text manipulation - dplyr - general data manipulation - tidyr - data cleaning - purrr - repeated application of a function

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──

## ✓ ggplot2 3.3.0           ✓ purrr   0.3.3      
## ✓ tibble  3.0.0.9000      ✓ dplyr   0.8.99.9002
## ✓ tidyr   1.0.2           ✓ stringr 1.4.0      
## ✓ readr   1.3.1           ✓ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(pdftools)

PDFs

The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.

# General function for download
download_monthly_stats_pdf <- function(year){
  
  message(paste0("Downloading ", year))
  
  # The general format is yearmonth like 201101 for Jan 2011.
  month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
  
  year_vec <- rep(year, 12)
  
  url_build <- function(year_vec, month_in){
      glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
  }
  
  # output to the pdfs folder
  download_monthly_pdf <- function(year, month, url_in){
    download.file(
      url = url_in,
      destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
      )
  }
  
  # build all the input urls and attach to an input dataframe
  full_df <- tibble(year = year_vec, month = month_in) %>% 
    mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build)) 
  
  # The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
  pwalk(full_df, .f = download_monthly_pdf)
  
}

We could apply that function to all the years of interest with another purrr::walk() call. This will run download_monthly_stats_pdf() for 2011, 2012, 2013, and 2014.

walk(c(2011:2014), download_monthly_stats_pdf)

List File Names

# list all the files we have downloaded so far
all_files <- list.files("pdfs")

length(all_files)

## [1] 48

We have 48 PDFs, as expected - 12 months x 4 years = 48!

Now let’s take a peek inside one of the PDFs.

Raw PDFs

When we run pdftools::pdf_text() we can see a decently formatted table. The main issue to consider is that there is a lot erroneous header descriptions, and there are unequal spacing between “columns” in the table. Importantly, each line of the PDF is separated by a newline \n. This is key to our strategy for pulling out individual lines.

pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf")

## [1] "                                                                                                                  Report Date:\n                                              DEPARTMENT OF THE TREASURY                                          30-MAR-2011\n                                    ALCOHOL AND TOBACCO TAX AND TRADE BUREAU\n                                                                                                                  Report Symbol:\n                                                    STATISTICAL REPORT - BEER                                     TTB S 5130-01-2011\n                                                       Reporting Period: January 2011                             Page: 1 of 1\n                                                                                                   Current Year         Prior Year\n                                                                                    Prior Year      Cumulative         Cumulative\nMANUFACTURE OF BEER                                      Current Month          Current Month      Year to Date        Year to Date\nProduction                                                    14,981,472               15,012,331     14,981,472           15,012,331\nRemovals\nTaxable ($7.00/$18.00 per barrel)\n   In bottles and cans                                        11,571,819               11,908,922     11,571,819           11,908,922\n   In barrels and kegs                                          1,245,125                1,245,143      1,245,125            1,245,143\n   Tax Determined, Premises Use                                     5,989                    5,267          5,989                5,267\n       Sub Total Taxable                                      12,822,933               13,159,332     12,822,933           13,159,332\nTax-free\n   For export                                                     264,669                  224,066        264,669              224,066\n   For vessels and aircraft                                             0                        0              0                    0\n   Consumed on brewery premises                                       886                      913            886                  913\n       Sub Total Tax-Free                                         265,555                  224,979        265,555              224,979\n   Total Removals                                             13,088,488               13,384,311     13,088,488           13,384,311\nStocks On Hand end-of-month:                                    9,896,961                9,993,268      9,896,961            9,993,268\nMATERIALS USED AT BREWERIES\n   Malt and malt products                                    322,480,722             330,304,432     322,480,722          330,304,432\n   Corn and corn products                                     58,632,672               56,705,162     58,632,672           56,705,162\n   Rice and rice products                                    108,112,318               59,701,345    108,112,318           59,701,345\n   Barley and barley products                                   4,705,175                3,668,374      4,705,175            3,668,374\n   Wheat and wheat products                                     1,210,137                1,409,685      1,210,137            1,409,685\n      Total Grain products                                   495,141,024             451,788,998     495,141,024          451,788,998\n   Sugar and syrups                                           73,793,509               47,308,358     73,793,509           47,308,358\n   Hops (dry)                                                   6,059,066                4,765,924      6,059,066            4,765,924\n   Hops (used as extracts)                                        296,605                  271,405        296,605              271,405\n   Other                                                        7,972,930              10,537,742       7,972,930          10,537,742\n       Total Non-Grain products                               88,122,110               62,883,429     88,122,110           62,883,429\nTotal Used                                                   583,263,134             514,672,427     583,263,134          514,672,427\n         296,605 Pounds of hops is equivalent to           212,541   pounds of extract JAN 2011\n         271,405 Pounds of hops is equivalent to           101,087   pounds of extract JAN 2010\nNOTE: Changes in figures from prior reports could be due to amended reports being filed.\n        This data is not final and may need to be amended.\nhttp://www.ttb.gov\n"

Split by row

We can use stringr::str_split() to separate the text at each of the \n newlines. This generates a list of character strings, we call unlist() to extract to a vector. We now have a nicely separated vector of character strings, where each row is a new string.

raw_text <- pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf") %>% 
  str_split("\n") %>% 
  unlist()

raw_text

##  [1] "                                                                                                                  Report Date:"        
##  [2] "                                              DEPARTMENT OF THE TREASURY                                          30-MAR-2011"         
##  [3] "                                    ALCOHOL AND TOBACCO TAX AND TRADE BUREAU"                                                          
##  [4] "                                                                                                                  Report Symbol:"      
##  [5] "                                                    STATISTICAL REPORT - BEER                                     TTB S 5130-01-2011"  
##  [6] "                                                       Reporting Period: January 2011                             Page: 1 of 1"        
##  [7] "                                                                                                   Current Year         Prior Year"    
##  [8] "                                                                                    Prior Year      Cumulative         Cumulative"     
##  [9] "MANUFACTURE OF BEER                                      Current Month          Current Month      Year to Date        Year to Date"   
## [10] "Production                                                    14,981,472               15,012,331     14,981,472           15,012,331" 
## [11] "Removals"                                                                                                                              
## [12] "Taxable ($7.00/$18.00 per barrel)"                                                                                                     
## [13] "   In bottles and cans                                        11,571,819               11,908,922     11,571,819           11,908,922" 
## [14] "   In barrels and kegs                                          1,245,125                1,245,143      1,245,125            1,245,143"
## [15] "   Tax Determined, Premises Use                                     5,989                    5,267          5,989                5,267"
## [16] "       Sub Total Taxable                                      12,822,933               13,159,332     12,822,933           13,159,332" 
## [17] "Tax-free"                                                                                                                              
## [18] "   For export                                                     264,669                  224,066        264,669              224,066"
## [19] "   For vessels and aircraft                                             0                        0              0                    0"
## [20] "   Consumed on brewery premises                                       886                      913            886                  913"
## [21] "       Sub Total Tax-Free                                         265,555                  224,979        265,555              224,979"
## [22] "   Total Removals                                             13,088,488               13,384,311     13,088,488           13,384,311" 
## [23] "Stocks On Hand end-of-month:                                    9,896,961                9,993,268      9,896,961            9,993,268"
## [24] "MATERIALS USED AT BREWERIES"                                                                                                           
## [25] "   Malt and malt products                                    322,480,722             330,304,432     322,480,722          330,304,432" 
## [26] "   Corn and corn products                                     58,632,672               56,705,162     58,632,672           56,705,162" 
## [27] "   Rice and rice products                                    108,112,318               59,701,345    108,112,318           59,701,345" 
## [28] "   Barley and barley products                                   4,705,175                3,668,374      4,705,175            3,668,374"
## [29] "   Wheat and wheat products                                     1,210,137                1,409,685      1,210,137            1,409,685"
## [30] "      Total Grain products                                   495,141,024             451,788,998     495,141,024          451,788,998" 
## [31] "   Sugar and syrups                                           73,793,509               47,308,358     73,793,509           47,308,358" 
## [32] "   Hops (dry)                                                   6,059,066                4,765,924      6,059,066            4,765,924"
## [33] "   Hops (used as extracts)                                        296,605                  271,405        296,605              271,405"
## [34] "   Other                                                        7,972,930              10,537,742       7,972,930          10,537,742" 
## [35] "       Total Non-Grain products                               88,122,110               62,883,429     88,122,110           62,883,429" 
## [36] "Total Used                                                   583,263,134             514,672,427     583,263,134          514,672,427" 
## [37] "         296,605 Pounds of hops is equivalent to           212,541   pounds of extract JAN 2011"                                       
## [38] "         271,405 Pounds of hops is equivalent to           101,087   pounds of extract JAN 2010"                                       
## [39] "NOTE: Changes in figures from prior reports could be due to amended reports being filed."                                              
## [40] "        This data is not final and may need to be amended."                                                                            
## [41] "http://www.ttb.gov"                                                                                                                    
## [42] ""

Build Table

Now that we have the data split into a vector we can start finding “rows” to drop. We can see that the 9th string is actually the column titles, and the table ends at the 36th string. However, this could change according to which PDF we are looking at, so rather than going by position we can use stringr::str_which() to match a logical with matched text.

# Start of table - column names
raw_text[9]

## [1] "MANUFACTURE OF BEER                                      Current Month          Current Month      Year to Date        Year to Date"

# End of table - last value
raw_text[36]

## [1] "Total Used                                                   583,263,134             514,672,427     583,263,134          514,672,427"

We get the same “rows” with our matching str_which().

# find start of table
stringr::str_which(raw_text, "MANUFACTURE OF BEER")

## [1] 9

# find end of table
stringr::str_which(raw_text, "Total Used")

## [1] 36

Let’s actually assign this now, rather than just printing. We can also remove leading/trailing whitespace with stringr::str_trim(). When we look at table_trimmed we can “see” a group of text strings that much closer resemble a table!

table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
  
# End of table (drop all the asterisks and the other descriptors)
table_end <- stringr::str_which(raw_text, "Total Used")
  
# Trim the table to the start/end and drop whitespace at each line
table_trimmed <- raw_text[table_start:table_end] %>% 
  str_trim()

table_trimmed

##  [1] "MANUFACTURE OF BEER                                      Current Month          Current Month      Year to Date        Year to Date"   
##  [2] "Production                                                    14,981,472               15,012,331     14,981,472           15,012,331" 
##  [3] "Removals"                                                                                                                              
##  [4] "Taxable ($7.00/$18.00 per barrel)"                                                                                                     
##  [5] "In bottles and cans                                        11,571,819               11,908,922     11,571,819           11,908,922"    
##  [6] "In barrels and kegs                                          1,245,125                1,245,143      1,245,125            1,245,143"   
##  [7] "Tax Determined, Premises Use                                     5,989                    5,267          5,989                5,267"   
##  [8] "Sub Total Taxable                                      12,822,933               13,159,332     12,822,933           13,159,332"        
##  [9] "Tax-free"                                                                                                                              
## [10] "For export                                                     264,669                  224,066        264,669              224,066"   
## [11] "For vessels and aircraft                                             0                        0              0                    0"   
## [12] "Consumed on brewery premises                                       886                      913            886                  913"   
## [13] "Sub Total Tax-Free                                         265,555                  224,979        265,555              224,979"       
## [14] "Total Removals                                             13,088,488               13,384,311     13,088,488           13,384,311"    
## [15] "Stocks On Hand end-of-month:                                    9,896,961                9,993,268      9,896,961            9,993,268"
## [16] "MATERIALS USED AT BREWERIES"                                                                                                           
## [17] "Malt and malt products                                    322,480,722             330,304,432     322,480,722          330,304,432"    
## [18] "Corn and corn products                                     58,632,672               56,705,162     58,632,672           56,705,162"    
## [19] "Rice and rice products                                    108,112,318               59,701,345    108,112,318           59,701,345"    
## [20] "Barley and barley products                                   4,705,175                3,668,374      4,705,175            3,668,374"   
## [21] "Wheat and wheat products                                     1,210,137                1,409,685      1,210,137            1,409,685"   
## [22] "Total Grain products                                   495,141,024             451,788,998     495,141,024          451,788,998"       
## [23] "Sugar and syrups                                           73,793,509               47,308,358     73,793,509           47,308,358"    
## [24] "Hops (dry)                                                   6,059,066                4,765,924      6,059,066            4,765,924"   
## [25] "Hops (used as extracts)                                        296,605                  271,405        296,605              271,405"   
## [26] "Other                                                        7,972,930              10,537,742       7,972,930          10,537,742"    
## [27] "Total Non-Grain products                               88,122,110               62,883,429     88,122,110           62,883,429"        
## [28] "Total Used                                                   583,263,134             514,672,427     583,263,134          514,672,427"

Remove all the extra whitespace

Next we need to remove all the huge whitespaces from between columns. The regular expression (regex) of "\\s{2,}" matches whitespaces of 2 or more. If we use stringr::str_replace_all() to take all the whitespaces > 2 and replace with a new delimiter such as "|" we can move to our next step. While we’re at it, let’s remove all the commas so that we can go straight to doubles rather than characters for all the beer production variables.

# Replace long spaces with a col break symbol
squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>% 
  str_remove_all(",")
squished_table

##  [1] "MANUFACTURE OF BEER|Current Month|Current Month|Year to Date|Year to Date"
##  [2] "Production|14981472|15012331|14981472|15012331"                           
##  [3] "Removals"                                                                 
##  [4] "Taxable ($7.00/$18.00 per barrel)"                                        
##  [5] "In bottles and cans|11571819|11908922|11571819|11908922"                  
##  [6] "In barrels and kegs|1245125|1245143|1245125|1245143"                      
##  [7] "Tax Determined Premises Use|5989|5267|5989|5267"                          
##  [8] "Sub Total Taxable|12822933|13159332|12822933|13159332"                    
##  [9] "Tax-free"                                                                 
## [10] "For export|264669|224066|264669|224066"                                   
## [11] "For vessels and aircraft|0|0|0|0"                                         
## [12] "Consumed on brewery premises|886|913|886|913"                             
## [13] "Sub Total Tax-Free|265555|224979|265555|224979"                           
## [14] "Total Removals|13088488|13384311|13088488|13384311"                       
## [15] "Stocks On Hand end-of-month:|9896961|9993268|9896961|9993268"             
## [16] "MATERIALS USED AT BREWERIES"                                              
## [17] "Malt and malt products|322480722|330304432|322480722|330304432"           
## [18] "Corn and corn products|58632672|56705162|58632672|56705162"               
## [19] "Rice and rice products|108112318|59701345|108112318|59701345"             
## [20] "Barley and barley products|4705175|3668374|4705175|3668374"               
## [21] "Wheat and wheat products|1210137|1409685|1210137|1409685"                 
## [22] "Total Grain products|495141024|451788998|495141024|451788998"             
## [23] "Sugar and syrups|73793509|47308358|73793509|47308358"                     
## [24] "Hops (dry)|6059066|4765924|6059066|4765924"                               
## [25] "Hops (used as extracts)|296605|271405|296605|271405"                      
## [26] "Other|7972930|10537742|7972930|10537742"                                  
## [27] "Total Non-Grain products|88122110|62883429|88122110|62883429"             
## [28] "Total Used|583263134|514672427|583263134|514672427"

Convert to tibble

Now we have a nicely formatted vector of strings! We can use tibble::enframe() to create a dataframe/tibble out of the vector.

# Convert to tibble
raw_df <- enframe(squished_table)

raw_df

## # A tibble: 28 x 2
##     name value                                                                  
##    <int> <chr>                                                                  
##  1     1 MANUFACTURE OF BEER|Current Month|Current Month|Year to Date|Year to D…
##  2     2 Production|14981472|15012331|14981472|15012331                         
##  3     3 Removals                                                               
##  4     4 Taxable ($7.00/$18.00 per barrel)                                      
##  5     5 In bottles and cans|11571819|11908922|11571819|11908922                
##  6     6 In barrels and kegs|1245125|1245143|1245125|1245143                    
##  7     7 Tax Determined Premises Use|5989|5267|5989|5267                        
##  8     8 Sub Total Taxable|12822933|13159332|12822933|13159332                  
##  9     9 Tax-free                                                               
## 10    10 For export|264669|224066|264669|224066                                 
## # … with 18 more rows

Next we can separate value into the 5 columns. Notice that there are a few “rows” where the data is NA as there were rows that acted only as indicators of the type of beer production. We’ll use them later.

year <- 2011
month <- "02"

# Convert to tibble
beer_df <- raw_df %>% 
    separate(value, 
             into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"), 
             sep = "\\|") %>% 
  slice(-1) %>% 
  mutate_at(vars(month_current:ytd_prior_year), as.double) %>% 
  mutate(year = as.integer(year), month = as.integer(month)) %>% 
  select(year, month, type, everything())

## Warning: Expected 5 pieces. Missing pieces filled with `NA` in 4 rows [3, 4, 9,
## 16].

beer_df

## # A tibble: 27 x 8
##     year month type   name month_current month_prior_year ytd_current
##    <int> <int> <chr> <int>         <dbl>            <dbl>       <dbl>
##  1  2011     2 Prod…     2      14981472         15012331    14981472
##  2  2011     2 Remo…     3            NA               NA          NA
##  3  2011     2 Taxa…     4            NA               NA          NA
##  4  2011     2 In b…     5      11571819         11908922    11571819
##  5  2011     2 In b…     6       1245125          1245143     1245125
##  6  2011     2 Tax …     7          5989             5267        5989
##  7  2011     2 Sub …     8      12822933         13159332    12822933
##  8  2011     2 Tax-…     9            NA               NA          NA
##  9  2011     2 For …    10        264669           224066      264669
## 10  2011     2 For …    11             0                0           0
## # … with 17 more rows, and 1 more variable: ytd_prior_year <dbl>

Technically at this point, we have successfully converted from raw text to a dataframe/table/tibble! HOWEVER, for many many examples in the wild you will need to do additional data cleaning, data manipulation, factor assignment, etc. As such, I’ll continue working on this to get to a final output. I’ll also work on repeating this many times as opposed to one time.

Alternative method via `readr`

Thanks to Grant McDermott for bringing up a good point here - I based this method off of tables where the white-space between columns is varying. If the white space is fixed between columns you could skip some steps as seen in the below example using readr::read_fwf(), courtesy of Grant. I believe for most cases either using readr::read_table() or readr::read_fwf() would be simple, but will keep the additional workflow steps in case they help someone down the road!

Overall, using readr to natively parse the table-format could save the workflow step of trimming, coercing to a tibble, and then separating, and just requires you to indicate the spacing of empty cells either manually with fwf_widths() or guessing/parsing of columns via fwf_empty().

table_start_fwf <- stringr::str_which(raw_text, "Production") ## Changed since we're dropping the first row anyway
table_end_fwf <- stringr::str_which(raw_text, "Total Used")

## Trim the table to the start/end (NB: Don't drop whitespace this time!)
table_trimmed_fwf <- raw_text[table_start_fwf:table_end_fwf]

beer_df_fwf <- read_fwf(table_trimmed_fwf, 
                        fwf_empty(table_trimmed_fwf, 
                                  col_names = c("type", 
                                                "month_current", 
                                                "month_prior_year", 
                                                "ytd_current", 
                                                "ytd_prior_year")
                                  )
                        )

## Warning: 8 parsing failures.
## row           col                expected                    actual         file
##   2 type          33 chars                8                         literal data
##   2 NA            5 columns               1 columns                 literal data
##   3 month_current 28 chars between fields 0 chars until end of line literal data
##   3 NA            5 columns               1 columns                 literal data
##   8 type          33 chars                8                         literal data
## ... ............. ....................... ......................... ............
## See problems(...) for more details.

beer_df_fwf

## # A tibble: 27 x 5
##    type                month_current month_prior_year ytd_current ytd_prior_year
##    <chr>                       <dbl>            <dbl>       <dbl>          <dbl>
##  1 Production               14981472         15012331    14981472       15012331
##  2 Removals                       NA               NA          NA             NA
##  3 Taxable ($7.00/$18…            NA               NA          NA             NA
##  4 In bottles and cans      11571819         11908922    11571819       11908922
##  5 In barrels and kegs       1245125          1245143     1245125        1245143
##  6 Tax Determined, Pr…          5989             5267        5989           5267
##  7 Sub Total Taxable        12822933         13159332    12822933       13159332
##  8 Tax-free                       NA               NA          NA             NA
##  9 For export                 264669           224066      264669         224066
## 10 For vessels and ai…             0                0           0              0
## # … with 17 more rows

One more alternative would be to just use readr::read_table() or readr::read_table2(). Now in practice this should be fairly robust, and works just fine for the examples here, but for messier tables it may fail which leads to the more complex and longer workflow shown below. Specifically, from the readr::read_table() docs:

read_table() and read_table2() are designed to read the type of textual data where each column is separated by one (or more) > columns of space.

read_table2() is like read.table(), it allows any number of whitespace characters between columns, and the lines can be of different lengths.

read_table() is more strict, each line must be the same length, and each field is in the same position in every line. It first finds empty columns and then parses like a fixed width file.

read_table(raw_text[table_start:table_end], skip =1,
           col_names = c("type", "month_current", "month_prior_year", 
                         "ytd_current", "ytd_prior_year"))

## # A tibble: 27 x 5
##    type                month_current month_prior_year ytd_current ytd_prior_year
##    <chr>                       <dbl>            <dbl>       <dbl>          <dbl>
##  1 Production               14981472         15012331    14981472       15012331
##  2 Removals                       NA               NA          NA             NA
##  3 Taxable ($7.00/$18…            NA               NA          NA             NA
##  4 In bottles and cans      11571819         11908922    11571819       11908922
##  5 In barrels and kegs       1245125          1245143     1245125        1245143
##  6 Tax Determined, Pr…          5989             5267        5989           5267
##  7 Sub Total Taxable        12822933         13159332    12822933       13159332
##  8 Tax-free                       NA               NA          NA             NA
##  9 For export                 264669           224066      264669         224066
## 10 For vessels and ai…             0                0           0              0
## # … with 17 more rows

Proper Cleaning

This is actually two datasets that are combined into one large reporting table. As such we need to identify the specific row/point to split the dataset at. We can filter to just the row that matches either the string “MATERIALS USED” or “IN POUNDS”, as that indicates a label starting the 2nd dataset.

slice_num <- beer_df %>% 
  # find a string that has MATERIALS USED or IN POUNDS
  # | means OR
    filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
    pull(name)

slice_num

## [1] 16

Split dataframe

Next we will add a column based on logic for the slice_num, and assign a grouping variable for either Barrels Produced (dataset 1) or Pounds of Materials Used (dataset 2). We can then drop the unneeded rows with a filter(), group_by() the newly produced grouping variable, and use dplyr::group_split() to separate the combined dataset into a list of both datasets.

# split data into materials vs barrels produced
split_df <- beer_df %>% 
  mutate(data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
         type = str_remove(type, ":")) %>% 
  select(data_type, everything(), -name) %>% 
  filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>% 
  group_by(data_type) %>% 
  group_split()

glimpse(split_df)

## list<df[,8]> [1:2] 
## $ : tibble [14 × 8] (S3: tbl_df/tbl/data.frame)
## $ : tibble [12 × 8] (S3: tbl_df/tbl/data.frame)
## @ ptype: tibble [0 × 8] (S3: tbl_df/tbl/data.frame)

Factor cleaning and final dataframes

We can see that the split_df object is a list of 2 tibbles/dataframes. We can now operate on the individual dataframes and finalize the factor cleaning and assignment to make the data a bit tidier and analysis ready.

manufacture_df <- split_df[[1]] %>% 
  mutate(
    tax_status = case_when(
      type %in% c("In bottles and cans", "In kegs", "In barrels and kegs",
                  "Tax Determined, Premises Use") ~ "Taxable",
      type == "Sub Total Taxable" ~ "Sub Total Taxable",
      type %in% c("For export", "For vessels and aircraft", 
                  "Consumed on brewery premises") ~ "Tax Free",
      type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
      type %in% c("Production", "Total Removals", 
                  "Stocks On Hand end-of-month:") ~ "Totals"
      ),
    tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
    ) %>% 
  filter(!is.na(tax_status)) %>% 
  select(data_type, tax_status, everything())

# clean up the material dataset
material_df <- split_df[[2]] %>% 
  mutate(
    material_type = case_when(
      str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
      str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
      str_detect(type, "Total") ~ type
    )
  ) %>% 
  select(data_type, material_type, everything())

Print the dataframes

The manufacture dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific barrels produced.

manufacture_df

## # A tibble: 9 x 10
##   data_type tax_status  year month type  month_current month_prior_year
##   <chr>     <chr>      <int> <int> <chr>         <dbl>            <dbl>
## 1 Barrels … Totals      2011     2 Prod…      14981472         15012331
## 2 Barrels … Taxable     2011     2 In b…      11571819         11908922
## 3 Barrels … Taxable     2011     2 In b…       1245125          1245143
## 4 Barrels … Sub Total…  2011     2 Sub …      12822933         13159332
## 5 Barrels … Tax Free    2011     2 For …        264669           224066
## 6 Barrels … Tax Free    2011     2 For …             0                0
## 7 Barrels … Tax Free    2011     2 Cons…           886              913
## 8 Barrels … Sub Total…  2011     2 Sub …        265555           224979
## 9 Barrels … Totals      2011     2 Tota…      13088488         13384311
## # … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
## #   tax_rate <chr>

The material dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific pounds of product used.

material_df

## # A tibble: 12 x 9
##    data_type material_type  year month type  month_current month_prior_year
##    <chr>     <chr>         <int> <int> <chr>         <dbl>            <dbl>
##  1 Pounds o… Grain Produc…  2011     2 Malt…     322480722        330304432
##  2 Pounds o… Grain Produc…  2011     2 Corn…      58632672         56705162
##  3 Pounds o… Grain Produc…  2011     2 Rice…     108112318         59701345
##  4 Pounds o… Grain Produc…  2011     2 Barl…       4705175          3668374
##  5 Pounds o… Grain Produc…  2011     2 Whea…       1210137          1409685
##  6 Pounds o… Total Grain …  2011     2 Tota…     495141024        451788998
##  7 Pounds o… Non-Grain Pr…  2011     2 Suga…      73793509         47308358
##  8 Pounds o… Non-Grain Pr…  2011     2 Hops…       6059066          4765924
##  9 Pounds o… Non-Grain Pr…  2011     2 Hops…        296605           271405
## 10 Pounds o… Non-Grain Pr…  2011     2 Other       7972930         10537742
## 11 Pounds o… Total Non-Gr…  2011     2 Tota…      88122110         62883429
## 12 Pounds o… Total Used     2011     2 Tota…     583263134        514672427
## # … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>

Finished Cleaning

We have now finished cleaning the manufacting and material dataframes! However, we did this all line-by-line without functions and would need to repeat this for the other 47 PDFs! Let’s convert ALL that code into a function that outputs the final dataframes.

Use a function

# create a function that works for most years
get_beer_tables <- function(year, month) {
  
  # read in the raw PDF
  raw_text <- pdftools::pdf_text(glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")) %>%
    str_split("\n") %>%
    unlist()

  ## Build Table

  # find start of table
  table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")

  # End of table (drop all the asterisks and the other descriptors)
  table_end <- stringr::str_which(raw_text, "Total Used")

  # Trim the table to the start/end and drop whitespace at each line
  table_trimmed <- raw_text[table_start:table_end] %>%
    str_trim()

  table_trimmed

  ### Remove all the extra whitespace

  # Replace long spaces with a col break symbol
  squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
    str_remove_all(",")

  ### Convert to tibble

  # Convert to tibble
  raw_df <- enframe(squished_table)
  
  # split the rows into their columns
  beer_df <- suppressWarnings(raw_df %>%
    separate(value,
      into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"),
      sep = "\\|"
    ) %>%
    slice(-1) %>%
    mutate_at(vars(month_current:ytd_prior_year), as.double) %>%
    mutate(year = as.integer(year), month = as.integer(month)) %>%
    select(year, month, type, everything()))

  ### Proper Cleaning

  # ID the specific row/point to split the dataset at.

  slice_num <- beer_df %>%
    # find a string that has MATERIALS USED or IN POUNDS
    # | means OR
    filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
    pull(name)

  #### Split dataframe

  # split data into materials vs barrels produced
  split_df <- suppressWarnings(beer_df %>%
    mutate(
      data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
      type = str_remove(type, ":")
    ) %>%
    select(data_type, everything(), -name) %>%
    filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>%
    group_by(data_type) %>%
    group_split())

  #### Factor cleaning and final dataframes

  # clean manufacture df
  manufacture_df <- split_df[[1]] %>%
    mutate(
      tax_status = case_when(
        type %in% c(
          "In bottles and cans", "In kegs", "In barrels and kegs",
          "Tax Determined, Premises Use"
        ) ~ "Taxable",
        type == "Sub Total Taxable" ~ "Sub Total Taxable",
        type %in% c(
          "For export", "For vessels and aircraft",
          "Consumed on brewery premises"
        ) ~ "Tax Free",
        type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
        type %in% c(
          "Production", "Total Removals",
          "Stocks On Hand end-of-month:"
        ) ~ "Totals"
      ),
      tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
    ) %>%
    filter(!is.na(tax_status)) %>%
    select(data_type, tax_status, everything())

  # clean up the material dataset
  material_df <- split_df[[2]] %>%
    mutate(
      material_type = case_when(
        str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
        str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
        str_detect(type, "Total") ~ type
      )
    ) %>%
    select(data_type, material_type, everything())

  # output a list of both dfs
  list(manufacture_df, material_df)
}

Really the only code we have changed is we added a glue call to add the year, month to which PDF to read in, and we have the output as a list of both dataframes. Let’s test our function!

get_beer_tables(2011, "01")

## [[1]]
## # A tibble: 9 x 10
##   data_type tax_status  year month type  month_current month_prior_year
##   <chr>     <chr>      <int> <int> <chr>         <dbl>            <dbl>
## 1 Barrels … Totals      2011     1 Prod…      14981472         15012331
## 2 Barrels … Taxable     2011     1 In b…      11571819         11908922
## 3 Barrels … Taxable     2011     1 In b…       1245125          1245143
## 4 Barrels … Sub Total…  2011     1 Sub …      12822933         13159332
## 5 Barrels … Tax Free    2011     1 For …        264669           224066
## 6 Barrels … Tax Free    2011     1 For …             0                0
## 7 Barrels … Tax Free    2011     1 Cons…           886              913
## 8 Barrels … Sub Total…  2011     1 Sub …        265555           224979
## 9 Barrels … Totals      2011     1 Tota…      13088488         13384311
## # … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
## #   tax_rate <chr>
## 
## [[2]]
## # A tibble: 12 x 9
##    data_type material_type  year month type  month_current month_prior_year
##    <chr>     <chr>         <int> <int> <chr>         <dbl>            <dbl>
##  1 Pounds o… Grain Produc…  2011     1 Malt…     322480722        330304432
##  2 Pounds o… Grain Produc…  2011     1 Corn…      58632672         56705162
##  3 Pounds o… Grain Produc…  2011     1 Rice…     108112318         59701345
##  4 Pounds o… Grain Produc…  2011     1 Barl…       4705175          3668374
##  5 Pounds o… Grain Produc…  2011     1 Whea…       1210137          1409685
##  6 Pounds o… Total Grain …  2011     1 Tota…     495141024        451788998
##  7 Pounds o… Non-Grain Pr…  2011     1 Suga…      73793509         47308358
##  8 Pounds o… Non-Grain Pr…  2011     1 Hops…       6059066          4765924
##  9 Pounds o… Non-Grain Pr…  2011     1 Hops…        296605           271405
## 10 Pounds o… Non-Grain Pr…  2011     1 Other       7972930         10537742
## 11 Pounds o… Total Non-Gr…  2011     1 Tota…      88122110         62883429
## 12 Pounds o… Total Used     2011     1 Tota…     583263134        514672427
## # … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>

Boom! Function is working for our example, let’s try it out with more than 1 input via purrr!

`purrr` - iteration without repetition

We’ll be using pmap() to apply our function multiple times, where pmap can take any number of inputs. For example if we call get_beer_tables() via pmap, we can get our tables for that 1 year/month combo!

# Quick test of purrr
pmap(list(2011, "02"), get_beer_tables)

## [[1]]
## [[1]][[1]]
## # A tibble: 9 x 10
##   data_type tax_status  year month type  month_current month_prior_year
##   <chr>     <chr>      <int> <int> <chr>         <dbl>            <dbl>
## 1 Barrels … Totals      2011     2 Prod…      14350832         14297845
## 2 Barrels … Taxable     2011     2 In b…      11509264         11847648
## 3 Barrels … Taxable     2011     2 In b…       1234147          1308677
## 4 Barrels … Sub Total…  2011     2 Sub …      12749447         13161061
## 5 Barrels … Tax Free    2011     2 For …        279150           263721
## 6 Barrels … Tax Free    2011     2 For …             0                0
## 7 Barrels … Tax Free    2011     2 Cons…           942              719
## 8 Barrels … Sub Total…  2011     2 Sub …        280092           264440
## 9 Barrels … Totals      2011     2 Tota…      13029539         13425501
## # … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
## #   tax_rate <chr>
## 
## [[1]][[2]]
## # A tibble: 12 x 9
##    data_type material_type  year month type  month_current month_prior_year
##    <chr>     <chr>         <int> <int> <chr>         <dbl>            <dbl>
##  1 Pounds o… Grain Produc…  2011     2 Malt…     307076591        305543380
##  2 Pounds o… Grain Produc…  2011     2 Corn…      53981943         54486996
##  3 Pounds o… Grain Produc…  2011     2 Rice…      54287863         55198940
##  4 Pounds o… Grain Produc…  2011     2 Barl…       4322047          3726630
##  5 Pounds o… Grain Produc…  2011     2 Whea…        955671          1683401
##  6 Pounds o… Total Grain …  2011     2 Tota…     420624115        420639347
##  7 Pounds o… Non-Grain Pr…  2011     2 Suga…      63374850         46718854
##  8 Pounds o… Non-Grain Pr…  2011     2 Hops…       7617974          4899463
##  9 Pounds o… Non-Grain Pr…  2011     2 Hops…        275963           268589
## 10 Pounds o… Non-Grain Pr…  2011     2 Other       7997916          9022834
## 11 Pounds o… Total Non-Gr…  2011     2 Tota…      79266703         60909740
## 12 Pounds o… Total Used     2011     2 Tota…     499890818        481549087
## # … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>

However our goal is all the inputs at once! We can create a vector of the month inputs as character strings, and then use tidyr::crossing() to output all the possible combinations of year + month as a dataframe. Notice two columns, year and month with a length of 48 - equal to all of our PDFs!

# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")

# use crossing to generate all combos for the data 
# 2010 is missing, but as the data has prior year data we can theoretically
# add it back in after the fact

crossing(
  year = c(2011:2014), 
  month = month_num
  ) %>% glimpse()

## Rows: 48
## Columns: 2
## $ year  <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
## $ month <chr> "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11…

All possible combos

We can use tidyr::crossing() again to generate the possible inputs and create the output dataframes as list column of two dataframes. Running this takes only about 2 seconds across the 48 PDFs! The output is not very exciting as the data is simply the year & month columns, plus a list-column called data. Let’s get the final outputs!

# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")

# use crossing to generate all combos for the data 
df_2011_2014 <- crossing(year = c(2011:2014), 
                         month = month_num) %>% 
  mutate(data = pmap(., get_beer_tables))

df_2011_2014

## # A tibble: 48 x 3
##     year month data      
##    <int> <chr> <list>    
##  1  2011 01    <list [2]>
##  2  2011 02    <list [2]>
##  3  2011 03    <list [2]>
##  4  2011 04    <list [2]>
##  5  2011 05    <list [2]>
##  6  2011 06    <list [2]>
##  7  2011 07    <list [2]>
##  8  2011 08    <list [2]>
##  9  2011 09    <list [2]>
## 10  2011 10    <list [2]>
## # … with 38 more rows

Final output

We can now get just the output data, drop the other columns. We’re still working with list-columns, so let’s get to the manufacture_df and material_df.

final_output <- df_2011_2014 %>%
  # grab the data into respective columns
  mutate(manufacture_data = map(data, 1),
         material_data = map(data, 2)) %>% 
  select(manufacture_data, material_data)

final_output

## # A tibble: 48 x 2
##    manufacture_data  material_data    
##    <list>            <list>           
##  1 <tibble [9 × 10]> <tibble [12 × 9]>
##  2 <tibble [9 × 10]> <tibble [12 × 9]>
##  3 <tibble [9 × 10]> <tibble [12 × 9]>
##  4 <tibble [9 × 10]> <tibble [12 × 9]>
##  5 <tibble [9 × 10]> <tibble [12 × 9]>
##  6 <tibble [9 × 10]> <tibble [12 × 9]>
##  7 <tibble [9 × 10]> <tibble [12 × 9]>
##  8 <tibble [9 × 10]> <tibble [12 × 9]>
##  9 <tibble [9 × 10]> <tibble [12 × 9]>
## 10 <tibble [9 × 10]> <tibble [12 × 9]>
## # … with 38 more rows

The manufacture dataframe can be combined as below.

# Grab just the manufacture data
manufacture_df <- final_output %>% 
  select(manufacture_data) %>% 
  unnest(manufacture_data)

# Grab just the manufacture data
material_df <- final_output %>% 
  select(material_data) %>% 
  unnest(material_data)

And now we can look at the outputs!

Manufacture dataset

glimpse(manufacture_df)

## Rows: 432
## Columns: 10
## $ data_type        <chr> "Barrels Produced", "Barrels Produced", "Barrels Pro…
## $ tax_status       <chr> "Totals", "Taxable", "Taxable", "Sub Total Taxable",…
## $ year             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ month            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ type             <chr> "Production", "In bottles and cans", "In barrels and…
## $ month_current    <dbl> 14981472, 11571819, 1245125, 12822933, 264669, 0, 88…
## $ month_prior_year <dbl> 15012331, 11908922, 1245143, 13159332, 224066, 0, 91…
## $ ytd_current      <dbl> 14981472, 11571819, 1245125, 12822933, 264669, 0, 88…
## $ ytd_prior_year   <dbl> 15012331, 11908922, 1245143, 13159332, 224066, 0, 91…
## $ tax_rate         <chr> "$7/$18 per barrel", "$7/$18 per barrel", "$7/$18 pe…

Material dataset

glimpse(material_df)

## Rows: 576
## Columns: 9
## $ data_type        <chr> "Pounds of Materials Used", "Pounds of Materials Use…
## $ material_type    <chr> "Grain Products", "Grain Products", "Grain Products"…
## $ year             <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ month            <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ type             <chr> "Malt and malt products", "Corn and corn products", …
## $ month_current    <dbl> 322480722, 58632672, 108112318, 4705175, 1210137, 49…
## $ month_prior_year <dbl> 330304432, 56705162, 59701345, 3668374, 1409685, 451…
## $ ytd_current      <dbl> 322480722, 58632672, 108112318, 4705175, 1210137, 49…
## $ ytd_prior_year   <dbl> 330304432, 56705162, 59701345, 3668374, 1409685, 451…

Do it all in 6 Lines of Code!

Now all of that could have been done in about 6 lines of tidyverse code since we created a function.

# Use crossing to generate all 48 combos for the data 
# Use purrr to read in, clean, and output the 96 tables from the 48 PDFs
final_output <- crossing(year = c(2011:2014), month = month_num) %>% 
  mutate(data = pmap(., get_beer_tables)) %>% 
  mutate(manufacture_data = map(data, 1), material_data = map(data, 2)) %>% 
  select(manufacture_data, material_data)

# Grab just the manufacture data
manufacture_df <- final_output %>% select(manufacture_data) %>% unnest(manufacture_data)

# Grab just the manufacture data
material_df <- final_output %>% select(material_data) %>% unnest(material_data)

Beer and pdftools - a vignette

Thomas Mock

3/30/2020

Scraping Complex Tables from PDFs with PDF Tools

Load Libraries

PDFs

List File Names

Raw PDFs

Split by row

Build Table

Remove all the extra whitespace

Convert to tibble

Alternative method via `readr`

Proper Cleaning

Split dataframe

Factor cleaning and final dataframes

Print the dataframes

Finished Cleaning

Use a function

`purrr` - iteration without repetition

All possible combos

Final output

Manufacture dataset

Material dataset

Do it all in 6 Lines of Code!

Beer and pdftools - a vignette

Thomas Mock

3/30/2020

Scraping Complex Tables from PDFs with PDF Tools

Load Libraries

PDFs

List File Names

Raw PDFs

Split by row

Build Table

Remove all the extra whitespace

Convert to tibble

Alternative method via readr

Proper Cleaning

Split dataframe

Factor cleaning and final dataframes

Print the dataframes

Finished Cleaning

Use a function

purrr - iteration without repetition

All possible combos

Final output

Manufacture dataset

Material dataset

Do it all in 6 Lines of Code!

Alternative method via `readr`

`purrr` - iteration without repetition