Pdftools

Introducing pdftools - A fast and portable PDF extractor

Scientific articles are typically locked away in PDF format, a format designed primarily for printing but not so great for searching or indexing. The new pdftools package allows for extracting text and metadata from pdf files in R. From the extracted plain-text one could find articles discussing a particular drug or species name, without having to rely on publishers providing metadata, or pay-walled search engines.

The pdftools slightly overlaps with the Rpoppler package by Kurt Hornik. The main motivation behind developing pdftools was that Rpoppler depends on glib, which does not work well on Mac and Windows. The pdftools package uses the poppler c++ interface together with Rcpp, which results in a lighter and more portable implementation.

Installing pdftools

cran github

install.packages("pdftools")

Getting started

The ?pdftools manual page shows a brief overview of the main utilities. The most important function is pdf_text which returns a character vector of length equal to the number of pages in the pdf. Each string in the vector contains a plain text version of the text on that page.

library(pdftools)
download.file("http://arxiv.org/pdf/1403.2805.pdf", "1403.2805.pdf", mode = "wb")
txt <- pdf_text("1403.2805.pdf")

# first page text
cat(txt[1])

# second page text
cat(txt[2])

In addition, the package has some utilities to extract other data from the PDF file. The pdf_toc function shows the table of contents, i.e. the section headers which pdf readers usually display in a menu on the left. It looks pretty in JSON:

# Table of contents
toc <- pdf_toc("1403.2805.pdf")

# Show as JSON
jsonlite::toJSON(toc, auto_unbox = TRUE, pretty = TRUE)

Other functions provide information about fonts, attachments and metadata such as the author, creation date or tags.

# Author, version, etc
info <- pdf_info("1403.2805.pdf")

# Table with fonts
fonts <- pdf_fonts("1403.2805.pdf")

Bonus feature: rendering pdf

A bonus feature on most platforms is rendering of PDF files to bitmap arrays. The poppler library provides all functionality to implement a complete PDF reader, including graphical display of the content. In R we can use pdf_render_page to render a page of the PDF into a bitmap, which can be stored as e.g. png or jpeg.

# renders pdf to bitmap array
bitmap <- pdf_render_page("1403.2805.pdf", page = 1)

# save bitmap image
png::writePNG(bitmap, "page.png")
jpeg::writeJPEG(bitmap, "page.jpeg")
webp::write_webp(bitmap, "page.webp")

This feature now works on all platforms.

Limitations

Data scientists are often interested in data from tables. Unfortunately the pdf format is pretty dumb and does not have notion of a table (unlike for example HTML). Tabular data in a pdf file is nothing more than strategically positioned lines and text, which makes it difficult to extract the raw data.

txt <- pdf_text("http://arxiv.org/pdf/1406.4806.pdf")

# some tables
cat(txt[18])
cat(txt[19])

Pdftools usually does a decent job in retaining the positioning of table elements when converting from pdf to text. But the output is still very dependent on the formatting of the original pdf table, which makes it very difficult to write a generic table extractor. But with a little creativity you might be able to parse the table data from the text output of a given paper.

PDF utilities
Extracting text, fonts, attachments and metadata from a pdf file.

Usage
    pdf_info(pdf, opw = "", upw = "")
    pdf_text(pdf, opw = "", upw = "")
    pdf_fonts(pdf, opw = "", upw = "")
    pdf_attachments(pdf, opw = "", upw = "")
    pdf_toc(pdf, opw = "", upw = "")

Arguments
    pdf                  file path or raw vector with pdf data
    opw                  string with owner password to open pdf
    upw                  string with user password to open pdf

Examples
    # Just a random pdf file
    pdf_file <- file.path(R.home("doc"), "NEWS.pdf")
    info <- pdf_info(pdf_file)
    text <- pdf_text(pdf_file)
    fonts <- pdf_fonts(pdf_file)
    files <- pdf_attachments(pdf_file)

pdf_render_page                                                                                         3
  pdf_render_page              Render / Convert PDF

Description
    High quality conversion of pdf page(s) to png, jpeg or tiff format, or render into a raw bitmap array
    for further processing in R. This functionality is only available if libpoppler was compiled with
    cairo support.

Usage
    pdf_render_page(pdf, page = 1, dpi = 72, numeric = FALSE, opw = "", upw = "")
    pdf_convert(pdf, format = "png", pages = NULL, filenames = NULL, dpi = 72, opw = "", upw = "", verbose = TRUE)
    poppler_config()

Arguments
    pdf                file path or raw vector with pdf data
    page               which page to render
    dpi                resolution (dots per inch) to render
    numeric            convert raw output to (0-1) real values
    opw                owner password
    upw                user password
    format             string with output format such as "png" or "jpeg". Must be equal to one of
                       poppler_config()$supported_image_formats.
    pages              vector with one-based page numbers to render. NULL means all pages.
    filenames          vector of equal length to pages with output filenames. May also be a format
                       string which is expanded using pages and format respectively.
    verbose            print some progress info to stdout

Examples
    # Rendering should be supported on all platforms now
    if(poppler_config()$can_render){
    # convert few pages to png
    file.copy(file.path(Sys.getenv("R_DOC_DIR"), "NEWS.pdf"), "news.pdf")

  pdf_convert("news.pdf", pages = 1:3)
  # render into raw bitmap
  bitmap <- pdf_render_page("news.pdf")
  # save to bitmap formats
  png::writePNG(bitmap, "page.png")
  jpeg::writeJPEG(bitmap, "page.jpeg")
  webp::write_webp(bitmap, "page.webp")
  # Higher quality
  bitmap <- pdf_render_page("news.pdf", page = 1, dpi = 300)
  png::writePNG(bitmap, "page.png")
  # slightly more efficient
  bitmap_raw <- pdf_render_page("news.pdf", numeric = FALSE)
  webp::write_webp(bitmap_raw, "page.webp")
  }

Index
pdf_attachments (pdf_info), 2
pdf_convert (pdf_render_page), 3
pdf_fonts (pdf_info), 2
pdf_info, 2, 3
pdf_render_page, 2, 3
pdf_text, 2
pdf_text (pdf_info), 2
pdf_toc (pdf_info), 2
pdftools (pdf_info), 2
poppler_config (pdf_render_page), 3
render (pdf_render_page), 3
suppressMessages, 2