The goal of this is to provide a guide to extracting irregularly formatted tables from PDFs.
We’ll use ROpenSci’s pdftools
package along with several tidyverse
packages: - stringr
- text manipulation - dplyr
- general data manipulation - tidyr
- data cleaning - purrr
- repeated application of a function
library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✓ ggplot2 3.3.0 ✓ purrr 0.3.3
## ✓ tibble 3.0.0.9000 ✓ dplyr 0.8.99.9002
## ✓ tidyr 1.0.2 ✓ stringr 1.4.0
## ✓ readr 1.3.1 ✓ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(pdftools)
The PDFs for this guide come from Alcohol and Tobacco Tax and Trade Bureau. We’ll use the 2011-2014 data for this example (84 total PDFs). For the purpose of today the files have already been downloaded, but I used the following script.
# General function for download
download_monthly_stats_pdf <- function(year){
message(paste0("Downloading ", year))
# The general format is yearmonth like 201101 for Jan 2011.
month_in <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
year_vec <- rep(year, 12)
url_build <- function(year_vec, month_in){
glue::glue("https://www.ttb.gov/images/pdfs/statistics/{year}/{year}{month_in}beer.pdf")
}
# output to the pdfs folder
download_monthly_pdf <- function(year, month, url_in){
download.file(
url = url_in,
destfile = glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")
)
}
# build all the input urls and attach to an input dataframe
full_df <- tibble(year = year_vec, month = month_in) %>%
mutate(url_in = pmap_chr(.l = list(year_vec, month_in), .f = url_build))
# The pwalk here takes all 3 inputs and applies them to download_monthly_pdf function
pwalk(full_df, .f = download_monthly_pdf)
}
We could apply that function to all the years of interest with another purrr::walk()
call. This will run download_monthly_stats_pdf()
for 2011, 2012, 2013, and 2014.
walk(c(2011:2014), download_monthly_stats_pdf)
# list all the files we have downloaded so far
all_files <- list.files("pdfs")
length(all_files)
## [1] 48
We have 48 PDFs, as expected - 12 months x 4 years = 48!
Now let’s take a peek inside one of the PDFs.
When we run pdftools::pdf_text()
we can see a decently formatted table. The main issue to consider is that there is a lot erroneous header descriptions, and there are unequal spacing between “columns” in the table. Importantly, each line of the PDF is separated by a newline \n
. This is key to our strategy for pulling out individual lines.
pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf")
## [1] " Report Date:\n DEPARTMENT OF THE TREASURY 30-MAR-2011\n ALCOHOL AND TOBACCO TAX AND TRADE BUREAU\n Report Symbol:\n STATISTICAL REPORT - BEER TTB S 5130-01-2011\n Reporting Period: January 2011 Page: 1 of 1\n Current Year Prior Year\n Prior Year Cumulative Cumulative\nMANUFACTURE OF BEER Current Month Current Month Year to Date Year to Date\nProduction 14,981,472 15,012,331 14,981,472 15,012,331\nRemovals\nTaxable ($7.00/$18.00 per barrel)\n In bottles and cans 11,571,819 11,908,922 11,571,819 11,908,922\n In barrels and kegs 1,245,125 1,245,143 1,245,125 1,245,143\n Tax Determined, Premises Use 5,989 5,267 5,989 5,267\n Sub Total Taxable 12,822,933 13,159,332 12,822,933 13,159,332\nTax-free\n For export 264,669 224,066 264,669 224,066\n For vessels and aircraft 0 0 0 0\n Consumed on brewery premises 886 913 886 913\n Sub Total Tax-Free 265,555 224,979 265,555 224,979\n Total Removals 13,088,488 13,384,311 13,088,488 13,384,311\nStocks On Hand end-of-month: 9,896,961 9,993,268 9,896,961 9,993,268\nMATERIALS USED AT BREWERIES\n Malt and malt products 322,480,722 330,304,432 322,480,722 330,304,432\n Corn and corn products 58,632,672 56,705,162 58,632,672 56,705,162\n Rice and rice products 108,112,318 59,701,345 108,112,318 59,701,345\n Barley and barley products 4,705,175 3,668,374 4,705,175 3,668,374\n Wheat and wheat products 1,210,137 1,409,685 1,210,137 1,409,685\n Total Grain products 495,141,024 451,788,998 495,141,024 451,788,998\n Sugar and syrups 73,793,509 47,308,358 73,793,509 47,308,358\n Hops (dry) 6,059,066 4,765,924 6,059,066 4,765,924\n Hops (used as extracts) 296,605 271,405 296,605 271,405\n Other 7,972,930 10,537,742 7,972,930 10,537,742\n Total Non-Grain products 88,122,110 62,883,429 88,122,110 62,883,429\nTotal Used 583,263,134 514,672,427 583,263,134 514,672,427\n 296,605 Pounds of hops is equivalent to 212,541 pounds of extract JAN 2011\n 271,405 Pounds of hops is equivalent to 101,087 pounds of extract JAN 2010\nNOTE: Changes in figures from prior reports could be due to amended reports being filed.\n This data is not final and may need to be amended.\nhttp://www.ttb.gov\n"
We can use stringr::str_split()
to separate the text at each of the \n
newlines. This generates a list of character strings, we call unlist()
to extract to a vector. We now have a nicely separated vector of character strings, where each row is a new string.
raw_text <- pdftools::pdf_text("pdfs/ttb_monthly_stats_2011-01.pdf") %>%
str_split("\n") %>%
unlist()
raw_text
## [1] " Report Date:"
## [2] " DEPARTMENT OF THE TREASURY 30-MAR-2011"
## [3] " ALCOHOL AND TOBACCO TAX AND TRADE BUREAU"
## [4] " Report Symbol:"
## [5] " STATISTICAL REPORT - BEER TTB S 5130-01-2011"
## [6] " Reporting Period: January 2011 Page: 1 of 1"
## [7] " Current Year Prior Year"
## [8] " Prior Year Cumulative Cumulative"
## [9] "MANUFACTURE OF BEER Current Month Current Month Year to Date Year to Date"
## [10] "Production 14,981,472 15,012,331 14,981,472 15,012,331"
## [11] "Removals"
## [12] "Taxable ($7.00/$18.00 per barrel)"
## [13] " In bottles and cans 11,571,819 11,908,922 11,571,819 11,908,922"
## [14] " In barrels and kegs 1,245,125 1,245,143 1,245,125 1,245,143"
## [15] " Tax Determined, Premises Use 5,989 5,267 5,989 5,267"
## [16] " Sub Total Taxable 12,822,933 13,159,332 12,822,933 13,159,332"
## [17] "Tax-free"
## [18] " For export 264,669 224,066 264,669 224,066"
## [19] " For vessels and aircraft 0 0 0 0"
## [20] " Consumed on brewery premises 886 913 886 913"
## [21] " Sub Total Tax-Free 265,555 224,979 265,555 224,979"
## [22] " Total Removals 13,088,488 13,384,311 13,088,488 13,384,311"
## [23] "Stocks On Hand end-of-month: 9,896,961 9,993,268 9,896,961 9,993,268"
## [24] "MATERIALS USED AT BREWERIES"
## [25] " Malt and malt products 322,480,722 330,304,432 322,480,722 330,304,432"
## [26] " Corn and corn products 58,632,672 56,705,162 58,632,672 56,705,162"
## [27] " Rice and rice products 108,112,318 59,701,345 108,112,318 59,701,345"
## [28] " Barley and barley products 4,705,175 3,668,374 4,705,175 3,668,374"
## [29] " Wheat and wheat products 1,210,137 1,409,685 1,210,137 1,409,685"
## [30] " Total Grain products 495,141,024 451,788,998 495,141,024 451,788,998"
## [31] " Sugar and syrups 73,793,509 47,308,358 73,793,509 47,308,358"
## [32] " Hops (dry) 6,059,066 4,765,924 6,059,066 4,765,924"
## [33] " Hops (used as extracts) 296,605 271,405 296,605 271,405"
## [34] " Other 7,972,930 10,537,742 7,972,930 10,537,742"
## [35] " Total Non-Grain products 88,122,110 62,883,429 88,122,110 62,883,429"
## [36] "Total Used 583,263,134 514,672,427 583,263,134 514,672,427"
## [37] " 296,605 Pounds of hops is equivalent to 212,541 pounds of extract JAN 2011"
## [38] " 271,405 Pounds of hops is equivalent to 101,087 pounds of extract JAN 2010"
## [39] "NOTE: Changes in figures from prior reports could be due to amended reports being filed."
## [40] " This data is not final and may need to be amended."
## [41] "http://www.ttb.gov"
## [42] ""
Now that we have the data split into a vector we can start finding “rows” to drop. We can see that the 9th string is actually the column titles, and the table ends at the 36th string. However, this could change according to which PDF we are looking at, so rather than going by position we can use stringr::str_which()
to match a logical with matched text.
# Start of table - column names
raw_text[9]
## [1] "MANUFACTURE OF BEER Current Month Current Month Year to Date Year to Date"
# End of table - last value
raw_text[36]
## [1] "Total Used 583,263,134 514,672,427 583,263,134 514,672,427"
We get the same “rows” with our matching str_which()
.
# find start of table
stringr::str_which(raw_text, "MANUFACTURE OF BEER")
## [1] 9
# find end of table
stringr::str_which(raw_text, "Total Used")
## [1] 36
Let’s actually assign this now, rather than just printing. We can also remove leading/trailing whitespace with stringr::str_trim()
. When we look at table_trimmed we can “see” a group of text strings that much closer resemble a table!
table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
# End of table (drop all the asterisks and the other descriptors)
table_end <- stringr::str_which(raw_text, "Total Used")
# Trim the table to the start/end and drop whitespace at each line
table_trimmed <- raw_text[table_start:table_end] %>%
str_trim()
table_trimmed
## [1] "MANUFACTURE OF BEER Current Month Current Month Year to Date Year to Date"
## [2] "Production 14,981,472 15,012,331 14,981,472 15,012,331"
## [3] "Removals"
## [4] "Taxable ($7.00/$18.00 per barrel)"
## [5] "In bottles and cans 11,571,819 11,908,922 11,571,819 11,908,922"
## [6] "In barrels and kegs 1,245,125 1,245,143 1,245,125 1,245,143"
## [7] "Tax Determined, Premises Use 5,989 5,267 5,989 5,267"
## [8] "Sub Total Taxable 12,822,933 13,159,332 12,822,933 13,159,332"
## [9] "Tax-free"
## [10] "For export 264,669 224,066 264,669 224,066"
## [11] "For vessels and aircraft 0 0 0 0"
## [12] "Consumed on brewery premises 886 913 886 913"
## [13] "Sub Total Tax-Free 265,555 224,979 265,555 224,979"
## [14] "Total Removals 13,088,488 13,384,311 13,088,488 13,384,311"
## [15] "Stocks On Hand end-of-month: 9,896,961 9,993,268 9,896,961 9,993,268"
## [16] "MATERIALS USED AT BREWERIES"
## [17] "Malt and malt products 322,480,722 330,304,432 322,480,722 330,304,432"
## [18] "Corn and corn products 58,632,672 56,705,162 58,632,672 56,705,162"
## [19] "Rice and rice products 108,112,318 59,701,345 108,112,318 59,701,345"
## [20] "Barley and barley products 4,705,175 3,668,374 4,705,175 3,668,374"
## [21] "Wheat and wheat products 1,210,137 1,409,685 1,210,137 1,409,685"
## [22] "Total Grain products 495,141,024 451,788,998 495,141,024 451,788,998"
## [23] "Sugar and syrups 73,793,509 47,308,358 73,793,509 47,308,358"
## [24] "Hops (dry) 6,059,066 4,765,924 6,059,066 4,765,924"
## [25] "Hops (used as extracts) 296,605 271,405 296,605 271,405"
## [26] "Other 7,972,930 10,537,742 7,972,930 10,537,742"
## [27] "Total Non-Grain products 88,122,110 62,883,429 88,122,110 62,883,429"
## [28] "Total Used 583,263,134 514,672,427 583,263,134 514,672,427"
Next we need to remove all the huge whitespaces from between columns. The regular expression (regex) of "\\s{2,}"
matches whitespaces of 2 or more. If we use stringr::str_replace_all()
to take all the whitespaces > 2 and replace with a new delimiter such as "|"
we can move to our next step. While we’re at it, let’s remove all the commas so that we can go straight to doubles rather than characters for all the beer production variables.
# Replace long spaces with a col break symbol
squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
str_remove_all(",")
squished_table
## [1] "MANUFACTURE OF BEER|Current Month|Current Month|Year to Date|Year to Date"
## [2] "Production|14981472|15012331|14981472|15012331"
## [3] "Removals"
## [4] "Taxable ($7.00/$18.00 per barrel)"
## [5] "In bottles and cans|11571819|11908922|11571819|11908922"
## [6] "In barrels and kegs|1245125|1245143|1245125|1245143"
## [7] "Tax Determined Premises Use|5989|5267|5989|5267"
## [8] "Sub Total Taxable|12822933|13159332|12822933|13159332"
## [9] "Tax-free"
## [10] "For export|264669|224066|264669|224066"
## [11] "For vessels and aircraft|0|0|0|0"
## [12] "Consumed on brewery premises|886|913|886|913"
## [13] "Sub Total Tax-Free|265555|224979|265555|224979"
## [14] "Total Removals|13088488|13384311|13088488|13384311"
## [15] "Stocks On Hand end-of-month:|9896961|9993268|9896961|9993268"
## [16] "MATERIALS USED AT BREWERIES"
## [17] "Malt and malt products|322480722|330304432|322480722|330304432"
## [18] "Corn and corn products|58632672|56705162|58632672|56705162"
## [19] "Rice and rice products|108112318|59701345|108112318|59701345"
## [20] "Barley and barley products|4705175|3668374|4705175|3668374"
## [21] "Wheat and wheat products|1210137|1409685|1210137|1409685"
## [22] "Total Grain products|495141024|451788998|495141024|451788998"
## [23] "Sugar and syrups|73793509|47308358|73793509|47308358"
## [24] "Hops (dry)|6059066|4765924|6059066|4765924"
## [25] "Hops (used as extracts)|296605|271405|296605|271405"
## [26] "Other|7972930|10537742|7972930|10537742"
## [27] "Total Non-Grain products|88122110|62883429|88122110|62883429"
## [28] "Total Used|583263134|514672427|583263134|514672427"
Now we have a nicely formatted vector of strings! We can use tibble::enframe()
to create a dataframe/tibble out of the vector.
# Convert to tibble
raw_df <- enframe(squished_table)
raw_df
## # A tibble: 28 x 2
## name value
## <int> <chr>
## 1 1 MANUFACTURE OF BEER|Current Month|Current Month|Year to Date|Year to D…
## 2 2 Production|14981472|15012331|14981472|15012331
## 3 3 Removals
## 4 4 Taxable ($7.00/$18.00 per barrel)
## 5 5 In bottles and cans|11571819|11908922|11571819|11908922
## 6 6 In barrels and kegs|1245125|1245143|1245125|1245143
## 7 7 Tax Determined Premises Use|5989|5267|5989|5267
## 8 8 Sub Total Taxable|12822933|13159332|12822933|13159332
## 9 9 Tax-free
## 10 10 For export|264669|224066|264669|224066
## # … with 18 more rows
Next we can separate value into the 5 columns. Notice that there are a few “rows” where the data is NA as there were rows that acted only as indicators of the type of beer production. We’ll use them later.
year <- 2011
month <- "02"
# Convert to tibble
beer_df <- raw_df %>%
separate(value,
into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"),
sep = "\\|") %>%
slice(-1) %>%
mutate_at(vars(month_current:ytd_prior_year), as.double) %>%
mutate(year = as.integer(year), month = as.integer(month)) %>%
select(year, month, type, everything())
## Warning: Expected 5 pieces. Missing pieces filled with `NA` in 4 rows [3, 4, 9,
## 16].
beer_df
## # A tibble: 27 x 8
## year month type name month_current month_prior_year ytd_current
## <int> <int> <chr> <int> <dbl> <dbl> <dbl>
## 1 2011 2 Prod… 2 14981472 15012331 14981472
## 2 2011 2 Remo… 3 NA NA NA
## 3 2011 2 Taxa… 4 NA NA NA
## 4 2011 2 In b… 5 11571819 11908922 11571819
## 5 2011 2 In b… 6 1245125 1245143 1245125
## 6 2011 2 Tax … 7 5989 5267 5989
## 7 2011 2 Sub … 8 12822933 13159332 12822933
## 8 2011 2 Tax-… 9 NA NA NA
## 9 2011 2 For … 10 264669 224066 264669
## 10 2011 2 For … 11 0 0 0
## # … with 17 more rows, and 1 more variable: ytd_prior_year <dbl>
Technically at this point, we have successfully converted from raw text to a dataframe/table/tibble! HOWEVER, for many many examples in the wild you will need to do additional data cleaning, data manipulation, factor assignment, etc. As such, I’ll continue working on this to get to a final output. I’ll also work on repeating this many times as opposed to one time.
readr
Thanks to Grant McDermott for bringing up a good point here - I based this method off of tables where the white-space between columns is varying. If the white space is fixed between columns you could skip some steps as seen in the below example using readr::read_fwf()
, courtesy of Grant. I believe for most cases either using readr::read_table()
or readr::read_fwf()
would be simple, but will keep the additional workflow steps in case they help someone down the road!
Overall, using readr
to natively parse the table-format could save the workflow step of trimming, coercing to a tibble, and then separating, and just requires you to indicate the spacing of empty cells either manually with fwf_widths()
or guessing/parsing of columns via fwf_empty()
.
table_start_fwf <- stringr::str_which(raw_text, "Production") ## Changed since we're dropping the first row anyway
table_end_fwf <- stringr::str_which(raw_text, "Total Used")
## Trim the table to the start/end (NB: Don't drop whitespace this time!)
table_trimmed_fwf <- raw_text[table_start_fwf:table_end_fwf]
beer_df_fwf <- read_fwf(table_trimmed_fwf,
fwf_empty(table_trimmed_fwf,
col_names = c("type",
"month_current",
"month_prior_year",
"ytd_current",
"ytd_prior_year")
)
)
## Warning: 8 parsing failures.
## row col expected actual file
## 2 type 33 chars 8 literal data
## 2 NA 5 columns 1 columns literal data
## 3 month_current 28 chars between fields 0 chars until end of line literal data
## 3 NA 5 columns 1 columns literal data
## 8 type 33 chars 8 literal data
## ... ............. ....................... ......................... ............
## See problems(...) for more details.
beer_df_fwf
## # A tibble: 27 x 5
## type month_current month_prior_year ytd_current ytd_prior_year
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Production 14981472 15012331 14981472 15012331
## 2 Removals NA NA NA NA
## 3 Taxable ($7.00/$18… NA NA NA NA
## 4 In bottles and cans 11571819 11908922 11571819 11908922
## 5 In barrels and kegs 1245125 1245143 1245125 1245143
## 6 Tax Determined, Pr… 5989 5267 5989 5267
## 7 Sub Total Taxable 12822933 13159332 12822933 13159332
## 8 Tax-free NA NA NA NA
## 9 For export 264669 224066 264669 224066
## 10 For vessels and ai… 0 0 0 0
## # … with 17 more rows
One more alternative would be to just use readr::read_table()
or readr::read_table2()
. Now in practice this should be fairly robust, and works just fine for the examples here, but for messier tables it may fail which leads to the more complex and longer workflow shown below. Specifically, from the readr::read_table()
docs:
read_table()
andread_table2()
are designed to read the type of textual data where each column is separated by one (or more) > columns of space.
read_table2()
is likeread.table()
, it allows any number of whitespace characters between columns, and the lines can be of different lengths.
read_table()
is more strict, each line must be the same length, and each field is in the same position in every line. It first finds empty columns and then parses like a fixed width file.
read_table(raw_text[table_start:table_end], skip =1,
col_names = c("type", "month_current", "month_prior_year",
"ytd_current", "ytd_prior_year"))
## # A tibble: 27 x 5
## type month_current month_prior_year ytd_current ytd_prior_year
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Production 14981472 15012331 14981472 15012331
## 2 Removals NA NA NA NA
## 3 Taxable ($7.00/$18… NA NA NA NA
## 4 In bottles and cans 11571819 11908922 11571819 11908922
## 5 In barrels and kegs 1245125 1245143 1245125 1245143
## 6 Tax Determined, Pr… 5989 5267 5989 5267
## 7 Sub Total Taxable 12822933 13159332 12822933 13159332
## 8 Tax-free NA NA NA NA
## 9 For export 264669 224066 264669 224066
## 10 For vessels and ai… 0 0 0 0
## # … with 17 more rows
This is actually two datasets that are combined into one large reporting table. As such we need to identify the specific row/point to split the dataset at. We can filter to just the row that matches either the string “MATERIALS USED” or “IN POUNDS”, as that indicates a label starting the 2nd dataset.
slice_num <- beer_df %>%
# find a string that has MATERIALS USED or IN POUNDS
# | means OR
filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
pull(name)
slice_num
## [1] 16
Next we will add a column based on logic for the slice_num
, and assign a grouping variable for either Barrels Produced
(dataset 1) or Pounds of Materials Used
(dataset 2). We can then drop the unneeded rows with a filter()
, group_by(
) the newly produced grouping variable, and use dplyr::group_split()
to separate the combined dataset into a list of both datasets.
# split data into materials vs barrels produced
split_df <- beer_df %>%
mutate(data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
type = str_remove(type, ":")) %>%
select(data_type, everything(), -name) %>%
filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>%
group_by(data_type) %>%
group_split()
glimpse(split_df)
## list<df[,8]> [1:2]
## $ : tibble [14 × 8] (S3: tbl_df/tbl/data.frame)
## $ : tibble [12 × 8] (S3: tbl_df/tbl/data.frame)
## @ ptype: tibble [0 × 8] (S3: tbl_df/tbl/data.frame)
We can see that the split_df
object is a list of 2 tibbles/dataframes. We can now operate on the individual dataframes and finalize the factor cleaning and assignment to make the data a bit tidier and analysis ready.
manufacture_df <- split_df[[1]] %>%
mutate(
tax_status = case_when(
type %in% c("In bottles and cans", "In kegs", "In barrels and kegs",
"Tax Determined, Premises Use") ~ "Taxable",
type == "Sub Total Taxable" ~ "Sub Total Taxable",
type %in% c("For export", "For vessels and aircraft",
"Consumed on brewery premises") ~ "Tax Free",
type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
type %in% c("Production", "Total Removals",
"Stocks On Hand end-of-month:") ~ "Totals"
),
tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
) %>%
filter(!is.na(tax_status)) %>%
select(data_type, tax_status, everything())
# clean up the material dataset
material_df <- split_df[[2]] %>%
mutate(
material_type = case_when(
str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
str_detect(type, "Total") ~ type
)
) %>%
select(data_type, material_type, everything())
The manufacture dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific barrels produced.
manufacture_df
## # A tibble: 9 x 10
## data_type tax_status year month type month_current month_prior_year
## <chr> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 Barrels … Totals 2011 2 Prod… 14981472 15012331
## 2 Barrels … Taxable 2011 2 In b… 11571819 11908922
## 3 Barrels … Taxable 2011 2 In b… 1245125 1245143
## 4 Barrels … Sub Total… 2011 2 Sub … 12822933 13159332
## 5 Barrels … Tax Free 2011 2 For … 264669 224066
## 6 Barrels … Tax Free 2011 2 For … 0 0
## 7 Barrels … Tax Free 2011 2 Cons… 886 913
## 8 Barrels … Sub Total… 2011 2 Sub … 265555 224979
## 9 Barrels … Totals 2011 2 Tota… 13088488 13384311
## # … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
## # tax_rate <chr>
The material dataframe now has the labels, factors, etc separated into nice columns, with the 4x columns for specific pounds of product used.
material_df
## # A tibble: 12 x 9
## data_type material_type year month type month_current month_prior_year
## <chr> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 Pounds o… Grain Produc… 2011 2 Malt… 322480722 330304432
## 2 Pounds o… Grain Produc… 2011 2 Corn… 58632672 56705162
## 3 Pounds o… Grain Produc… 2011 2 Rice… 108112318 59701345
## 4 Pounds o… Grain Produc… 2011 2 Barl… 4705175 3668374
## 5 Pounds o… Grain Produc… 2011 2 Whea… 1210137 1409685
## 6 Pounds o… Total Grain … 2011 2 Tota… 495141024 451788998
## 7 Pounds o… Non-Grain Pr… 2011 2 Suga… 73793509 47308358
## 8 Pounds o… Non-Grain Pr… 2011 2 Hops… 6059066 4765924
## 9 Pounds o… Non-Grain Pr… 2011 2 Hops… 296605 271405
## 10 Pounds o… Non-Grain Pr… 2011 2 Other 7972930 10537742
## 11 Pounds o… Total Non-Gr… 2011 2 Tota… 88122110 62883429
## 12 Pounds o… Total Used 2011 2 Tota… 583263134 514672427
## # … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>
We have now finished cleaning the manufacting and material dataframes! However, we did this all line-by-line without functions and would need to repeat this for the other 47 PDFs! Let’s convert ALL that code into a function that outputs the final dataframes.
# create a function that works for most years
get_beer_tables <- function(year, month) {
# read in the raw PDF
raw_text <- pdftools::pdf_text(glue::glue("pdfs/ttb_monthly_stats_{year}-{month}.pdf")) %>%
str_split("\n") %>%
unlist()
## Build Table
# find start of table
table_start <- stringr::str_which(raw_text, "MANUFACTURE OF BEER")
# End of table (drop all the asterisks and the other descriptors)
table_end <- stringr::str_which(raw_text, "Total Used")
# Trim the table to the start/end and drop whitespace at each line
table_trimmed <- raw_text[table_start:table_end] %>%
str_trim()
table_trimmed
### Remove all the extra whitespace
# Replace long spaces with a col break symbol
squished_table <- str_replace_all(table_trimmed, "\\s{2,}", "|") %>%
str_remove_all(",")
### Convert to tibble
# Convert to tibble
raw_df <- enframe(squished_table)
# split the rows into their columns
beer_df <- suppressWarnings(raw_df %>%
separate(value,
into = c("type", "month_current", "month_prior_year", "ytd_current", "ytd_prior_year"),
sep = "\\|"
) %>%
slice(-1) %>%
mutate_at(vars(month_current:ytd_prior_year), as.double) %>%
mutate(year = as.integer(year), month = as.integer(month)) %>%
select(year, month, type, everything()))
### Proper Cleaning
# ID the specific row/point to split the dataset at.
slice_num <- beer_df %>%
# find a string that has MATERIALS USED or IN POUNDS
# | means OR
filter(str_detect(type, "MATERIALS USED|IN POUNDS")) %>%
pull(name)
#### Split dataframe
# split data into materials vs barrels produced
split_df <- suppressWarnings(beer_df %>%
mutate(
data_type = ifelse(name >= slice_num, "Pounds of Materials Used", "Barrels Produced"),
type = str_remove(type, ":")
) %>%
select(data_type, everything(), -name) %>%
filter(!str_detect(type, "IN POUNDS|MATERIALS USED|MANUFACTURE OF BEER|BARRELS")) %>%
group_by(data_type) %>%
group_split())
#### Factor cleaning and final dataframes
# clean manufacture df
manufacture_df <- split_df[[1]] %>%
mutate(
tax_status = case_when(
type %in% c(
"In bottles and cans", "In kegs", "In barrels and kegs",
"Tax Determined, Premises Use"
) ~ "Taxable",
type == "Sub Total Taxable" ~ "Sub Total Taxable",
type %in% c(
"For export", "For vessels and aircraft",
"Consumed on brewery premises"
) ~ "Tax Free",
type == "Sub Total Tax-Free" ~ "Sub Total Tax-Free",
type %in% c(
"Production", "Total Removals",
"Stocks On Hand end-of-month:"
) ~ "Totals"
),
tax_rate = dplyr::if_else(year <= 2017, "$7/$18 per barrel", "$3.50/$16 per barrel")
) %>%
filter(!is.na(tax_status)) %>%
select(data_type, tax_status, everything())
# clean up the material dataset
material_df <- split_df[[2]] %>%
mutate(
material_type = case_when(
str_detect(type, "Malt|Corn|Rice|Barley|Wheat") ~ "Grain Products",
str_detect(type, "Sugar|Hops|Other") ~ "Non-Grain Products",
str_detect(type, "Total") ~ type
)
) %>%
select(data_type, material_type, everything())
# output a list of both dfs
list(manufacture_df, material_df)
}
Really the only code we have changed is we added a glue
call to add the year, month to which PDF to read in, and we have the output as a list of both dataframes. Let’s test our function!
get_beer_tables(2011, "01")
## [[1]]
## # A tibble: 9 x 10
## data_type tax_status year month type month_current month_prior_year
## <chr> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 Barrels … Totals 2011 1 Prod… 14981472 15012331
## 2 Barrels … Taxable 2011 1 In b… 11571819 11908922
## 3 Barrels … Taxable 2011 1 In b… 1245125 1245143
## 4 Barrels … Sub Total… 2011 1 Sub … 12822933 13159332
## 5 Barrels … Tax Free 2011 1 For … 264669 224066
## 6 Barrels … Tax Free 2011 1 For … 0 0
## 7 Barrels … Tax Free 2011 1 Cons… 886 913
## 8 Barrels … Sub Total… 2011 1 Sub … 265555 224979
## 9 Barrels … Totals 2011 1 Tota… 13088488 13384311
## # … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
## # tax_rate <chr>
##
## [[2]]
## # A tibble: 12 x 9
## data_type material_type year month type month_current month_prior_year
## <chr> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 Pounds o… Grain Produc… 2011 1 Malt… 322480722 330304432
## 2 Pounds o… Grain Produc… 2011 1 Corn… 58632672 56705162
## 3 Pounds o… Grain Produc… 2011 1 Rice… 108112318 59701345
## 4 Pounds o… Grain Produc… 2011 1 Barl… 4705175 3668374
## 5 Pounds o… Grain Produc… 2011 1 Whea… 1210137 1409685
## 6 Pounds o… Total Grain … 2011 1 Tota… 495141024 451788998
## 7 Pounds o… Non-Grain Pr… 2011 1 Suga… 73793509 47308358
## 8 Pounds o… Non-Grain Pr… 2011 1 Hops… 6059066 4765924
## 9 Pounds o… Non-Grain Pr… 2011 1 Hops… 296605 271405
## 10 Pounds o… Non-Grain Pr… 2011 1 Other 7972930 10537742
## 11 Pounds o… Total Non-Gr… 2011 1 Tota… 88122110 62883429
## 12 Pounds o… Total Used 2011 1 Tota… 583263134 514672427
## # … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>
Boom! Function is working for our example, let’s try it out with more than 1 input via purrr
!
purrr
- iteration without repetitionWe’ll be using pmap()
to apply our function multiple times, where pmap
can take any number of inputs. For example if we call get_beer_tables()
via pmap
, we can get our tables for that 1 year/month combo!
# Quick test of purrr
pmap(list(2011, "02"), get_beer_tables)
## [[1]]
## [[1]][[1]]
## # A tibble: 9 x 10
## data_type tax_status year month type month_current month_prior_year
## <chr> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 Barrels … Totals 2011 2 Prod… 14350832 14297845
## 2 Barrels … Taxable 2011 2 In b… 11509264 11847648
## 3 Barrels … Taxable 2011 2 In b… 1234147 1308677
## 4 Barrels … Sub Total… 2011 2 Sub … 12749447 13161061
## 5 Barrels … Tax Free 2011 2 For … 279150 263721
## 6 Barrels … Tax Free 2011 2 For … 0 0
## 7 Barrels … Tax Free 2011 2 Cons… 942 719
## 8 Barrels … Sub Total… 2011 2 Sub … 280092 264440
## 9 Barrels … Totals 2011 2 Tota… 13029539 13425501
## # … with 3 more variables: ytd_current <dbl>, ytd_prior_year <dbl>,
## # tax_rate <chr>
##
## [[1]][[2]]
## # A tibble: 12 x 9
## data_type material_type year month type month_current month_prior_year
## <chr> <chr> <int> <int> <chr> <dbl> <dbl>
## 1 Pounds o… Grain Produc… 2011 2 Malt… 307076591 305543380
## 2 Pounds o… Grain Produc… 2011 2 Corn… 53981943 54486996
## 3 Pounds o… Grain Produc… 2011 2 Rice… 54287863 55198940
## 4 Pounds o… Grain Produc… 2011 2 Barl… 4322047 3726630
## 5 Pounds o… Grain Produc… 2011 2 Whea… 955671 1683401
## 6 Pounds o… Total Grain … 2011 2 Tota… 420624115 420639347
## 7 Pounds o… Non-Grain Pr… 2011 2 Suga… 63374850 46718854
## 8 Pounds o… Non-Grain Pr… 2011 2 Hops… 7617974 4899463
## 9 Pounds o… Non-Grain Pr… 2011 2 Hops… 275963 268589
## 10 Pounds o… Non-Grain Pr… 2011 2 Other 7997916 9022834
## 11 Pounds o… Total Non-Gr… 2011 2 Tota… 79266703 60909740
## 12 Pounds o… Total Used 2011 2 Tota… 499890818 481549087
## # … with 2 more variables: ytd_current <dbl>, ytd_prior_year <dbl>
However our goal is all the inputs at once! We can create a vector of the month inputs as character strings, and then use tidyr::crossing()
to output all the possible combinations of year + month as a dataframe. Notice two columns, year and month with a length of 48 - equal to all of our PDFs!
# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# use crossing to generate all combos for the data
# 2010 is missing, but as the data has prior year data we can theoretically
# add it back in after the fact
crossing(
year = c(2011:2014),
month = month_num
) %>% glimpse()
## Rows: 48
## Columns: 2
## $ year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
## $ month <chr> "01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11…
We can use tidyr::crossing()
again to generate the possible inputs and create the output dataframes as list column of two dataframes. Running this takes only about 2 seconds across the 48 PDFs! The output is not very exciting as the data is simply the year & month columns, plus a list-column called data. Let’s get the final outputs!
# add the month_num as vector
month_num <- c("01", "02", "03", "04", "05", "06", "07", "08", "09", "10", "11", "12")
# use crossing to generate all combos for the data
df_2011_2014 <- crossing(year = c(2011:2014),
month = month_num) %>%
mutate(data = pmap(., get_beer_tables))
df_2011_2014
## # A tibble: 48 x 3
## year month data
## <int> <chr> <list>
## 1 2011 01 <list [2]>
## 2 2011 02 <list [2]>
## 3 2011 03 <list [2]>
## 4 2011 04 <list [2]>
## 5 2011 05 <list [2]>
## 6 2011 06 <list [2]>
## 7 2011 07 <list [2]>
## 8 2011 08 <list [2]>
## 9 2011 09 <list [2]>
## 10 2011 10 <list [2]>
## # … with 38 more rows
We can now get just the output data, drop the other columns. We’re still working with list-columns, so let’s get to the manufacture_df
and material_df
.
final_output <- df_2011_2014 %>%
# grab the data into respective columns
mutate(manufacture_data = map(data, 1),
material_data = map(data, 2)) %>%
select(manufacture_data, material_data)
final_output
## # A tibble: 48 x 2
## manufacture_data material_data
## <list> <list>
## 1 <tibble [9 × 10]> <tibble [12 × 9]>
## 2 <tibble [9 × 10]> <tibble [12 × 9]>
## 3 <tibble [9 × 10]> <tibble [12 × 9]>
## 4 <tibble [9 × 10]> <tibble [12 × 9]>
## 5 <tibble [9 × 10]> <tibble [12 × 9]>
## 6 <tibble [9 × 10]> <tibble [12 × 9]>
## 7 <tibble [9 × 10]> <tibble [12 × 9]>
## 8 <tibble [9 × 10]> <tibble [12 × 9]>
## 9 <tibble [9 × 10]> <tibble [12 × 9]>
## 10 <tibble [9 × 10]> <tibble [12 × 9]>
## # … with 38 more rows
The manufacture dataframe can be combined as below.
# Grab just the manufacture data
manufacture_df <- final_output %>%
select(manufacture_data) %>%
unnest(manufacture_data)
# Grab just the manufacture data
material_df <- final_output %>%
select(material_data) %>%
unnest(material_data)
And now we can look at the outputs!
glimpse(manufacture_df)
## Rows: 432
## Columns: 10
## $ data_type <chr> "Barrels Produced", "Barrels Produced", "Barrels Pro…
## $ tax_status <chr> "Totals", "Taxable", "Taxable", "Sub Total Taxable",…
## $ year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ type <chr> "Production", "In bottles and cans", "In barrels and…
## $ month_current <dbl> 14981472, 11571819, 1245125, 12822933, 264669, 0, 88…
## $ month_prior_year <dbl> 15012331, 11908922, 1245143, 13159332, 224066, 0, 91…
## $ ytd_current <dbl> 14981472, 11571819, 1245125, 12822933, 264669, 0, 88…
## $ ytd_prior_year <dbl> 15012331, 11908922, 1245143, 13159332, 224066, 0, 91…
## $ tax_rate <chr> "$7/$18 per barrel", "$7/$18 per barrel", "$7/$18 pe…
glimpse(material_df)
## Rows: 576
## Columns: 9
## $ data_type <chr> "Pounds of Materials Used", "Pounds of Materials Use…
## $ material_type <chr> "Grain Products", "Grain Products", "Grain Products"…
## $ year <int> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011…
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2…
## $ type <chr> "Malt and malt products", "Corn and corn products", …
## $ month_current <dbl> 322480722, 58632672, 108112318, 4705175, 1210137, 49…
## $ month_prior_year <dbl> 330304432, 56705162, 59701345, 3668374, 1409685, 451…
## $ ytd_current <dbl> 322480722, 58632672, 108112318, 4705175, 1210137, 49…
## $ ytd_prior_year <dbl> 330304432, 56705162, 59701345, 3668374, 1409685, 451…
Now all of that could have been done in about 6 lines of tidyverse
code since we created a function.
# Use crossing to generate all 48 combos for the data
# Use purrr to read in, clean, and output the 96 tables from the 48 PDFs
final_output <- crossing(year = c(2011:2014), month = month_num) %>%
mutate(data = pmap(., get_beer_tables)) %>%
mutate(manufacture_data = map(data, 1), material_data = map(data, 2)) %>%
select(manufacture_data, material_data)
# Grab just the manufacture data
manufacture_df <- final_output %>% select(manufacture_data) %>% unnest(manufacture_data)
# Grab just the manufacture data
material_df <- final_output %>% select(material_data) %>% unnest(material_data)