Data without borders – Part 1

In this multi-part post, I’ll be explaining how I extracted data from PDF tables of border apprehension data, compiled by the U.S. Customs and Border Protection agency, and used Python and Jupyter notebook to analyze the data.

  1. Extracting tabular data from PDFs
  2. Data cleaning
  3. Analysis and linear correlation

Before I jump into the technical details of this project, let’s take a step back to the beginning of this project, and explain how I got involved with border security data.

In late 2018, I left my job as a Tech Lead for an ‘upskill’ sabbatical, in order to learn more about python and machine learning using jupyter notebook. At the same time, a family friend was trying to extract data from a dataset that he’d received through a U.S. FOIA (Freedom of Information Act) request, containing all illegal border apprehensions from 2009 to 2017. I had collaborated with Dr. James Phelps in the past on another project, but this would be my first time using python on a full-fledged research project.

As of today, Dr. Phelps is continuing his research into discovering country-specific indicators that will predict increased immigration volumes. Supporting that research is where I came in, starting with the seemingly boring task of converting PDF tables into importable tabular text data.

Example of PDF

Screenshot 2019-02-27 at 17.52.02

As you can see, the PDF clearly contains a table of data, with data evident in several columns, as well a few columns with redacted data. Subject name is presumably omitted for NPI reasons, and lat/lon is most likely missing for operational security reasons.

The first goal in this project was to produce clean text data, similar to this example:

APP_DATE  BORDER   SECTOR   CITIZENSHIP  BIRTH_YEAR AGE GENDER
10/1/2016 SBO      RGV      HONDURAS     1999    16 Male        
10/1/2016 SBO      RGV      GUATEMALA    1994    22 Female      
10/1/2016 SBO      RGV      GUATEMALA    1994    22 Male        
10/1/2016 SBO      RGV      GUATEMALA    1984    32 Male        
10/1/2016 SBO      RGV      GUATEMALA    1994    22 Male        
10/1/2016 SBO      RGV      GUATEMALA    1980    35 Male        
10/1/2016 SBO      RGV      EL SALVADOR  1980    35 Male        
10/1/2016 SBO      RGV      GUATEMALA    1972    44 Female      
10/1/2016 SBO      RGV      EL SALVADOR  1989    27 Male        
10/1/2016 SBO      RGV      COLOMBIA     1991    24 Male        
10/1/2016 SBO      RGV      MEXICO       1996    20 Male        
10/1/2016 SBO      RGV      MEXICO       1982    34 Male        
10/1/2016 SBO      RGV      COLOMBIA     1986    30 Male        
10/1/2016 SBO      RGV      MEXICO       1974    41 Male        
10/1/2016 SBO      RGV      EL SALVADOR  1997    19 Male        
10/1/2016 SBO      RGV      EL SALVADOR  1995    21 Female      
10/1/2016 SBO      RGV      HONDURAS     1984    32 Female

At first, I tried using python libraries to directly read table data from PDFs. Camelot seemed the most promising. Unfortunately, this didn’t seem to support the specific format of the FOIA PDFs. In the end, I found a working solution with Xpdf, an open-source toolkit.

By using the following command, semi-usable text data was successfully extracted:

xpdf-tools-win-4.00\bin64\pdftotext.exe -table "USBP Nationwide APPs FY09_REDACTED.pdf" fy09tables.txt

In the next part of this blog series, I’ll show you how I cleaned up the raw text data into a format that could be easily fed into the python Pandas library.