Diving into the data — ChiPy Blog #2

8 min readOct 19, 2018

In my last blog post for this project, I was hoping to have wrapped up data analysis for the Chicago parking ticket data. Alas, I was so fresh, naive, and optimistic. I should know by now — things never turn out the way you plan.

Working with data takes time. You find you have ideas that are dead ends, or that your data is incomplete for your purposes. And when you’re getting the hang of a new coding language at the same time, progress sometimes looks more like what you tried and failed to do than what you have successfully pulled off. With that introduction, I’ll dive right into where my Pythonic efforts have taken me and how each little stop gets me closer to achieving some presentable results.

Diving in

Since the complete dataset of ticket data contains over 10 years of data and over 20 million rows of data, I started my initial exploration by working with a 50,000 row sample data from the year 2015.

The fact that it’s a sample and not the complete dataset is important and will come up later.

I started out by looking at the section of the data that concerned getting on the boot list. According to the city of Chicago website, the following makes someone eligible for a boot:

Accrues three (3) or more unpaid parking, red light, and/or automated speed enforcement tickets in final determination status.
Accrues two (2) unpaid parking, red light, and/or automated speed enforcement tickets in final determination status older than one year.

Pay attention to final determination status. I bolded it because it will also come up later.

Being on the boot list meant having notice_level = “SEIZ” in the dataset. So I fired up my Jupyter notebook and the pandas library and started by filtering the data for rows that had that value.

tickets[tickets.notice_level == "SEIZ"]

This resulted in 9422 rows of instances of vehicles on the boot list.

I then did a count of which violations in the dataset landed people on the boot list.

tickets.violation_description.value_counts()

The top few rows of output showed the violations that most commonly resulted in getting the boot:

EXPIRED PLATES OR TEMPORARY REGISTRATION              8791
STREET CLEANING                                       6273
EXP. METER NON-CENTRAL BUSINESS DISTRICT              5773
NO CITY STICKER VEHICLE UNDER/EQUAL TO 16,000 LBS.    5183
RESIDENTIAL PERMIT PARKING                            4013
PARKING/STANDING PROHIBITED ANYTIME                   3188
REAR AND FRONT PLATE REQUIRED                         2717
EXPIRED METER CENTRAL BUSINESS DISTRICT               2474
NO STANDING/PARKING TIME RESTRICTED                   1793
RUSH HOUR PARKING                                     1593
WITHIN 15' OF FIRE HYDRANT                            1002

Whoa, looks like most people get on the boot list at least partly because of having expired plates or registration. Good start, but from here my questions led me down a rabbit hole. I wanted to know whether the ticketing process always followed the city protocol of:

(1) 2 –3 unpaid violations

(2) A notice that they intend to find your car and boot it

(3) 21 days after the first notice, if the tickets are unpaid, the car is actually booted

(4) If the tickets remain unpaid 24 hours after a boot, then the car is towed.

The problem was…there was no towing data available. So all I had to go on was the 5 levels of notice_level:

“VIOL,” which means a notice of violation was sent
“SEIZ” indicates the vehicle is on the city’s boot list
“DETR” indicates a hearing officer found the vehicle owner was found liable for the citation
“FINL” indicates the unpaid ticket was sent to collections
“DLS” means the city intends to seek a license suspension.

I was curious about instances in which cars were towed directly instead of being put on the boot list first, but without the towing data, there was no way I could follow cars as they moved from “SEIZ” status to towed status.

But alas, still I tried.

What is “final determination” and where is it hiding in this data?

The issue was that I couldn’t determine if the “final determination status” that made someone eligible for the boot list meant the “FINL” or “DETR”. I settled on “FINL” and tried to narrow the data set down to license plate numbers that had one of those two statuses.

#this is creating a list of all the license plate numbers that have a boot warrant
tickets_SEIZ = tickets["license_plate_number"][tickets.notice_level == "SEIZ"]#and all those that have a "FINAL" or sent to collections. 
tickets_DETR = tickets["license_plate_number"][tickets.notice_level == "FINL"]

Then I checked out how many license plates each status had.

len(tickets_SEIZ)
len(tickets_FINL)

SEIZ had about 9000 plates, while FINL had about 4000. That told me that FINL wasn’t the criteria I thought it was. This data sample indicated that you didn’t need to rack up FINL tickets sent to collections to get on the boot list, since the FINL number was so much lower than SEIZ.

To explore further, I wanted to see how many violations each license plate got for either FINL or SEIZ. The following two lines of code return a list of license plates with a count of how many times they appeared.

#this returns the number of FINL violations per license plate number
count_FINL = tickets_FINL.value_counts()#this returns the number of SEIZ violations per license plate number
count_SEIZ = tickets_SEIZ.value_counts()

FINL vs. DETR

Since FINL didn’t seem to be making sense, I ran the whole thing over again, but this time comparing DETR and SEIZ. DETR returned about 6000 license plates, which was more promising than FINL, but still not what I was looking for. If you need 2 — 3 violations to get to SEIZ, I’d expect at least double the amount of DETR to SEIZ records. I wasn’t seeing that here, so I suspected that either both DETR and FINL were not what I needed to track the journey to the boot list OR that the random sample wasn’t random enough for my purposes.

I was ready to give up, but just for fun, I made a dataframe with just the SEIZ and DETR records:

tickets_SD = tickets[(tickets.notice_level == "SEIZ") | (tickets.notice_level == "DETR")]

Then I grouped them to visually compare then in a table (tickets_SD)

tickets_SD = tickets[(tickets.notice_level == "SEIZ") | (tickets.notice_level == "DETR")]

Fun with anonymized data

Another thing I realized is that the anonymized, hashed license plates numbers may not be completely unique, meaning that it would be impossible to follow individual vehicles through their boot list journey. The thing that made me suspect this was this one license plate number that showed up as having 1383 instances of boot list notifications in the data set. Since this was a 50,000 row sample from only one year, 2015, it seemed impossible to get that many notifications in such a short period of time.

I decided to check that one out:

scofflaw=tickets[tickets.license_plate_number=='603e09c12c607a2ecfdc8062d4120edd10b2f5499d76fb4cc5d7a8ec73f9e04d']

Once I took a look at “scofflaw,” I discovered that it probably served as a catch-all for miscellaneous tickets, since taking a look at the data showed that there were many different car makes all listed under the same plate number. Since no other license plate number had that many violations, I could keep that one in mind as a catch-all and exclude it from my future analyses.

What’s the gain from all this?

I learned that trying to follow a handful of vehicles with a high number of violations probably wasn’t going to work out too well with a random sample. What would work better for this would be to grab a complete year or other time span of data and try to see how statuses change over that period of time. Since this data is a random sample from a year, I can’t truly follow things over time since the sample is incomplete.

I also learned that I didn’t pick a good way to track how vehicles end up on the boot list. Both the DETR and FINL status don’t seem to match up to the “final determination” status that gets you a SEIZ status after 2–3 chances. Sadly, it doesn’t seem like comparing the notice_level statuses is the way to go to follow a vehicle over time.

My goals!

Further build my skills in pandas. Since I learned R first, it really felt like a “language barrier” to get the data manipulation tactics I learned how to do in the tidyverse translated into pandas. I coped with the help of my mentor, Ed, Google, StackOverflow, The Quick Python Book (that mentees got at the last Project Night!),the Pandas Cookbook, and Cheatsheet. I’d like to get more to the point where I’m using these as references rather than examples.
Instead of looking at SEIZ in terms of its relation to other notice levels, take a look at SEIZ in relation to the total amount of payments that a person has made after getting boot status, plus the current amount due when they land on the boot list. “Follow the money” to see how it relates to getting a boot. I plan to work on this in the sample data and then explore it in the more complete data set after I find something interesting.
Take a look at the most common violations that land people on the boot list and see how they compare to the cash amount for each violation. Are people getting on the boot list more often because they get one or more tickets with really high fines that they can’t afford to pay? How does the boot list stack up with big ticket items vs. smaller ticket items?
Pull a year from the main dataset and try to follow the boot journey for a complete year. Since the main dataset is impossibly large (trying to load it actually crashed both Ed’s computer and mine), Ed walked me through some code he wrote to read in the file and save it by year or by row. The code for reading this huge file in by year (plus some of my comments on what the lines do) is below.

"""
program to read the parking_tickets.csv file and break it into
smaller files by year.Assumes:
 headings are first line in file.
 year is beginning of second element in observation list.
 observations are already in consecutive date order.
 """
fin = open('data//ticket_sample_data.csv', 'r')line = fin.readline() # read first line in fileheadings = line # save first line as headingsfile_year = 'tbd'
 # creates variable for yet-to-be-created output files. 
#change file_year to the year you want to pull from the file.while line: # loop through and process observations
    line = fin.readline()
    if line:
        line_as_list = line.split(sep=',')
        obs_year = (line_as_list[1][0:4])
        if (obs_year == file_year):
            fout.write(line)
        else:
            
            fout = open('data//' + obs_year + '_p_t.csv', 'w') # create the output file    
            fout.write(headings) # write the headings to the new output file
            fout.write(line) # write the current observation to the file
            file_year = obs_year
fout.close()
fin.close()

I’m excited that a bit of automating things with Python has now slipped into this data analysis project too. Sometimes you need to do a little bit of everything to make the thing work.

Now that I have a better idea of the limitations and scope of the data, I can work with what I have and come out with analyses and findings that I can visualize. I’m aiming to wrap up analysis and start building a visualization of my findings for the web in a few weeks.