In this multi-part post, I’ll be explaining how I extracted data from PDF tables of border apprehension data, compiled by the U.S. Customs and Border Protection agency, and used Python and Jupyter notebook to analyze the data.
- Extracting tabular data from PDFs
- Data cleaning
- Analysis and linear correlation
Before I jump into the technical details of this project, let’s take a step back to the beginning of this project, and explain how I got involved with border security data.
In late 2018, I left my job as a Tech Lead for an ‘upskill’ sabbatical, in order to learn more about python and machine learning using jupyter notebook. At the same time, a family friend was trying to extract data from a dataset that he’d received through a U.S. FOIA (Freedom of Information Act) request, containing all illegal border apprehensions from 2009 to 2017. I had collaborated with Dr. James Phelps in the past on another project, but this would be my first time using python on a full-fledged research project.
As of today, Dr. Phelps is continuing his research into discovering country-specific indicators that will predict increased immigration volumes. Supporting that research is where I came in, starting with the seemingly boring task of converting PDF tables into importable tabular text data.
Example of PDF
As you can see, the PDF clearly contains a table of data, with data evident in several columns, as well a few columns with redacted data. Subject name is presumably omitted for NPI reasons, and lat/lon is most likely missing for operational security reasons.
The first goal in this project was to produce clean text data, similar to this example:
APP_DATE BORDER SECTOR CITIZENSHIP BIRTH_YEAR AGE GENDER 10/1/2016 SBO RGV HONDURAS 1999 16 Male 10/1/2016 SBO RGV GUATEMALA 1994 22 Female 10/1/2016 SBO RGV GUATEMALA 1994 22 Male 10/1/2016 SBO RGV GUATEMALA 1984 32 Male 10/1/2016 SBO RGV GUATEMALA 1994 22 Male 10/1/2016 SBO RGV GUATEMALA 1980 35 Male 10/1/2016 SBO RGV EL SALVADOR 1980 35 Male 10/1/2016 SBO RGV GUATEMALA 1972 44 Female 10/1/2016 SBO RGV EL SALVADOR 1989 27 Male 10/1/2016 SBO RGV COLOMBIA 1991 24 Male 10/1/2016 SBO RGV MEXICO 1996 20 Male 10/1/2016 SBO RGV MEXICO 1982 34 Male 10/1/2016 SBO RGV COLOMBIA 1986 30 Male 10/1/2016 SBO RGV MEXICO 1974 41 Male 10/1/2016 SBO RGV EL SALVADOR 1997 19 Male 10/1/2016 SBO RGV EL SALVADOR 1995 21 Female 10/1/2016 SBO RGV HONDURAS 1984 32 Female
At first, I tried using python libraries to directly read table data from PDFs. Camelot seemed the most promising. Unfortunately, this didn’t seem to support the specific format of the FOIA PDFs. In the end, I found a working solution with Xpdf, an open-source toolkit.
By using the following command, semi-usable text data was successfully extracted:
xpdf-tools-win-4.00\bin64\pdftotext.exe -table "USBP Nationwide APPs FY09_REDACTED.pdf" fy09tables.txt
In the next part of this blog series, I’ll show you how I cleaned up the raw text data into a format that could be easily fed into the python Pandas library.