Data without borders – Part 1

In this multi-part post, I’ll be explaining how I extracted data from PDF tables of border apprehension data, compiled by the U.S. Customs and Border Protection agency, and used Python and Jupyter notebook to analyze the data.

  1. Extracting tabular data from PDFs
  2. Data cleaning
  3. Analysis and linear correlation

Before I jump into the technical details of this project, let’s take a step back to the beginning of this project, and explain how I got involved with border security data.

In late 2018, I left my job as a Tech Lead for an ‘upskill’ sabbatical, in order to learn more about python and machine learning using jupyter notebook. At the same time, a family friend was trying to extract data from a dataset that he’d received through a U.S. FOIA (Freedom of Information Act) request, containing all illegal border apprehensions from 2009 to 2017. I had collaborated with Dr. James Phelps in the past on another project, but this would be my first time using python on a full-fledged research project.

As of today, Dr. Phelps is continuing his research into discovering country-specific indicators that will predict increased immigration volumes. Supporting that research is where I came in, starting with the seemingly boring task of converting PDF tables into importable tabular text data.

Example of PDF

Screenshot 2019-02-27 at 17.52.02

As you can see, the PDF clearly contains a table of data, with data evident in several columns, as well a few columns with redacted data. Subject name is presumably omitted for NPI reasons, and lat/lon is most likely missing for operational security reasons.

The first goal in this project was to produce clean text data, similar to this example:

APP_DATE  BORDER   SECTOR   CITIZENSHIP  BIRTH_YEAR AGE GENDER
10/1/2016 SBO      RGV      HONDURAS     1999    16 Male        
10/1/2016 SBO      RGV      GUATEMALA    1994    22 Female      
10/1/2016 SBO      RGV      GUATEMALA    1994    22 Male        
10/1/2016 SBO      RGV      GUATEMALA    1984    32 Male        
10/1/2016 SBO      RGV      GUATEMALA    1994    22 Male        
10/1/2016 SBO      RGV      GUATEMALA    1980    35 Male        
10/1/2016 SBO      RGV      EL SALVADOR  1980    35 Male        
10/1/2016 SBO      RGV      GUATEMALA    1972    44 Female      
10/1/2016 SBO      RGV      EL SALVADOR  1989    27 Male        
10/1/2016 SBO      RGV      COLOMBIA     1991    24 Male        
10/1/2016 SBO      RGV      MEXICO       1996    20 Male        
10/1/2016 SBO      RGV      MEXICO       1982    34 Male        
10/1/2016 SBO      RGV      COLOMBIA     1986    30 Male        
10/1/2016 SBO      RGV      MEXICO       1974    41 Male        
10/1/2016 SBO      RGV      EL SALVADOR  1997    19 Male        
10/1/2016 SBO      RGV      EL SALVADOR  1995    21 Female      
10/1/2016 SBO      RGV      HONDURAS     1984    32 Female

At first, I tried using python libraries to directly read table data from PDFs. Camelot seemed the most promising. Unfortunately, this didn’t seem to support the specific format of the FOIA PDFs. In the end, I found a working solution with Xpdf, an open-source toolkit.

By using the following command, semi-usable text data was successfully extracted:

xpdf-tools-win-4.00\bin64\pdftotext.exe -table "USBP Nationwide APPs FY09_REDACTED.pdf" fy09tables.txt

In the next part of this blog series, I’ll show you how I cleaned up the raw text data into a format that could be easily fed into the python Pandas library.

Sponsored Post Learn from the experts: Create a successful blog with our brand new courseThe WordPress.com Blog

Are you new to blogging, and do you want step-by-step guidance on how to publish and grow your blog? Learn more about our new Blogging for Beginners course and get 50% off through December 10th.

WordPress.com is excited to announce our newest offering: a course just for beginning bloggers where you’ll learn everything you need to know about blogging from the most trusted experts in the industry. We have helped millions of blogs get up and running, we know what works, and we want you to to know everything we know. This course provides all the fundamental skills and inspiration you need to get your blog started, an interactive community forum, and content updated annually.

The original Chromebook Pixel is still my favorite laptop

I’ve used many different laptops over the years, starting with the pioneering clamshell laptops made by Grid Systems, a monochrome 8086, up to my current ‘monster’ laptop, a weighty gaming laptop made by Gigabyte, with 17″ screen, GeForce GTX 1060 video, and more memory and IO than you can throw a stick at.

What’s your favorite laptop? Mine? The Chromebook Pixel, first released by Google in 2013. Offered as a free gift for attendees of Google’s I/O developer conference, this laptop introduced me to the linux-based Chromebook format. With a fast SSD drive, high-DPI screen, and built-in 4G modem, this laptop has proven to be the most versatile device I’ve used up until this point.

Image result for chromebook pixel

But wait, you say, this is 2019. Surely there’s a better laptop out there? Hopefully there is. Sadly the Pixel has reached it’s end-of-life, and no longer receives regular system updates from Google.

Introducing Anzelmo Dot Net

Zibaldone: an Italian vernacular commonplace book.

1ab343405a2b78b25be59f7db341a593-common-place-book-journal-prompts

Hi, I’m Tony Anzelmo. Welcome to my new blog, Anzelmo Dot Net – my thoughts on coding and data.

Why am I doing this? Long-time lurker, first-time blogger…

After years of procrastinating, I thought it would be good to start keeping a journal of some of the lessons I’ve learned throughout my career in IT and development. Plus, I’ve found that if I don’t write things down, I’ll usually forget about it.

This is also my way of giving back. So much of what I’ve learned in development and IT has come through countless articles and stackoverflow posts.

Who am I?

I’m an Italian-American currently living in London. I’m originally from Kentucky, though I’ve spent most of my adult life living in Colorado, which is where I met my wife of 8-years, Christy.

EDIT: (Up until August 2018) Currently  I worked as a Technical Lead for a company providing SaaS ticketing services for arts and theater organizations.

Currently I’m wrapping up a Machine Learning sabbatical, focusing on using Python, Jupyter notebook, and Tensorflow to build an Android-based eye gaze detection app.

In the office I wear most of the hats common to folks who work at small to medium-sized technology companies: architect, data geek, programmer, and machine whisperer.

In my spare time I dabble in Machine Learning and occasionally you’ll see me at hackathons having fun building mobile apps.

Previously I’ve worked at SpektrixMicrosoft, RateSetter, Sporting Solutions, MarkIt on Demand, and L3 Technologies.