Data.gov Roulette #1: Satisfaction Survey for the U.S. Citizenship and Immigration Services Customer E-Verify Program
For a while I’ve been meaning to automate some of the decision making behind looking at new data. I enjoy processing different data, making new graphs, and expanding my knowledge of the various python libraries I use, but it can be difficult to constantly have new and personally exciting ideas to visualize. Something like concern over whether there will be enough snow for outdoor activities in northern Wisconsin over Christmas translates into snow depth charts for similarly warm years from the last century.
The desire to get a handle on what policy was discussed in a Republican debate without subjecting myself to the pain of 3 hours of posturing, leads to some simple text analysis (spoiler: they love war and hate immigrants/taxes).
General curiosity over energy deployment trends (and the desire to cram too much data onto one graph leads to this mess.
But what about when I don’t have an idea but still want to do some data work? Initially my thought was to make giant lists of topics and modifiers to randomly generate a prompt that would look something like this:
Data Genre: Economic
Data Specific: Food Sales
Food Type: Peanuts
Chart Type: Exploded Pie
It’s probably immediately clear that I might not find sufficient data on peanut sales in Niger (and also, I wouldn’t make a pie graph). This was one of the key issues I had with this idea: many, possibly most, of the combinations I auto generated would be difficult to find data for. Data collection would become the task, and that wasn’t my intention. There are ways to constrain that problem, such as less comprehensive lists, but I didn’t want to compromise the initial idea, and data collection would still remain a significant portion of the work. . . enter www.data.gov.
Data.gov is not something I normally use. It’s updated seemingly at random, it’s not always particularly clear what is and isn’t available, and I typically know where each agency keeps the data I want because I’ve worked with it before. Despite my griping, it is still quite nice resource to have and I’m excited to see it continue to grow and evolve (unless the next administration kills it off). Once I remembered data.gov existed I realized I could scrape links from the main listing page, click through at random (with some filtering and controls on which links to click), then check how many datasets were at the clicked-through link and pick one at random. There is definitely an API for data.gov, and I probably should have used it, but I have fun parsing webpages with selenium and python, and since this is a personal project I thought working on some more generalize-able skills (rather than their specific API) would be more fruitful. The github repo for this script (and the entire project) is where I’ll be storing the results and making some notes if you want to see how the scraping aspect evolves over time.
After getting everything together and doing a few test runs to suss out some errors I found my first dataset, a satisfaction survey for the U.S. Citizenship and Immigration Services Customer E-Verify Program. Not particularly interesting or exciting, with the bonus that the data only exists in pdf form. There's not going to be much a writeup this week, because I spent the bulk of my time writing this intro and getting the scraping script together, but this topic also isn’t the most interesting, so it’s not a huge issue.
E-Verify itself is a fairly interesting program (and relevant, given the current primary campaign rhetoric on the Republican side) in that its purpose is to verify that someone has the right to work in the United States (i.e., is not an illegal immigrant). First established in 1997, it was expanded in 2007, and now requires all federal contractors to participate. The program is free, available through the internet, and maintained by the federal government. Several states have also required participation (for either specific businesses or all employers). Currently, the program serves more than 400,000 businesses.
The pdf of data (also saved to the github repo just in case) has its main points about actual customer satisfaction covered in the opening paragraphs of E-Verify’s Wikipedia article, so I’m going to focus on the survey response rate in the methodological section instead. Specifically, the % of total response that each state was responsible for.
First, I attempted copying and pasting the response table on page 11 a few different ways. None of them resulted in anything more than pasting the entire table into one cell of an Excel sheet. I ended up splitting it from one cell into a bunch of columns, then using offset to write a little excel formula to populate the columns I wanted. Both that .xlsx file and the .csv I ultimately used for graphing are in the repo. I use a csv instead of the excel file because the csv is markedly faster when running any code. Here’s the result:
Ultimately I removed the states with 0% response in both years, and did some data reorganization halfway through when I was having trouble sorting and cleaning individual years of data. Combining the two sets and just naming the columns ‘stat_year’ allowed me to write much simpler code when creating the figure.
The bulk of the survey response comes from Arizona, California, Missouri, and Texas (33% in 2010, 29% in 2011). Of those three, Missouri is the clear outlier in that is has a smaller population and is not a border state. However, this makes more sense when referring to state-specific laws related to E-Verify, as Missouri has a high rate of participation due to a law verifying legal employment, which Wikipedia states as:
“ The law prohibits businesses from knowingly employing, hiring, or continuing to employ an illegal immigrant to perform work within the state of Missouri. The E-Verify portion of the law does not apply to all businesses, but those businesses that do use E-Verify are provided an affirmative defense that the business has not violated the provisions of the law that prohibit the employment of illegal immigrant. All public employers are required to “actively participate” in E-Verify”
So that’s it for week one. I spent more time getting things up and running this week that I’ll have to in the future, so the writeup is a bit sparse, but hopefully vaguely informative.
I just rolled the dice to end today’s work on this project, and next week’s dataset is the Job Patterns For Minorities And Women In Private Industry, 2009 EEO-1 State Aggregate Report, which sounds a but more interesting than this week’s data.