Wrapping it up — ChiPy Blog #3

This is my third and final blog post for the ChiPy mentorship program.

Last time I had some goals for this point in the program. I think a lot of them were a bit lofty, but good. While I figured I’d be wrapping up analysis by now, the more I poked around with the data, the more questions came up. I’ve learned the lesson that city-collected data isn’t necessarily collected to be analyzed — and working with data that isn’t organized by best practices means that sometimes you have to get creative and switch up your approach. Also, sometimes less is more, and a sophisticated analysis may not be possible with the data in the way you expect.

Let’s go over my goals and what happened since last time. For each goal, I’ll talk about my results and what I want to do to close out the analysis before the end of my mentorship.

GOAL 1 :

  • I wanted to look at the total amount of payments that a person has made after getting boot status and the current amount due when they land on the boot list. “Follow the money” to see how it relates to getting a boot.
  • RESULTS: I wasn’t able to determine when “after getting boot status” occurred since the date variable correlated with the boot (SEIZ) notice (ticket_queue_date) did not really show how much time had passed between getting a qualifying ticket and achieving boot status.

Let me explain:

For the SEIZ data, I grouped by notice number (which is a unique ID for each notice of being on the boot list) to see the tickets that were associated with getting a boot list notice. Per city code, I expected about 2–3 tickets to land someone on the boot list.

tickets_SEIZ_notice = tickets_SEIZ_only[['issue_date', 'ticket_queue_date','ticket_number', 'notice_number']].groupby(['notice_number']).nunique()
tickets_SEIZ_notice.sort_values(by=['ticket_number'], ascending=False)
Output of the above code showing how many separate tickets and how many individual dates are associated with each notice number.

But it was hard to determine what the ticket_queue_date variable meant since those dates didn’t seem to match up well with the notices. I would see instances of one boot list notice for 4 tickets, with 4 different ticket_queue_dates (denoting different notice dates) for each ticket. I suspect the ticket_queue_date was not updated as robustly or consistently — maybe they just updated it when the ticket first became a violation notice, and/or overwrote the date as the ticket qualifies someone for a move into boot status. The ticket_queue_date variable was so inconsistent that I wasn’t sure I could trust it to be updated in a uniform manner.

Reader, I did analyses with it anyway, but please take these with a huge grain of salt.

I created ticket_time, a variable that attempted to determine the time between getting the ticket (issue_date) and getting on the the boot list. Then I used that in a bunch of descriptive stats to see how the time affected the payments.

Looking at the boot list data to see how long people were on it, how much they owed, and how much they paid.

It seems like the 50th percentile of folks in this random sample of the boot list had 71 days between getting the tickets and getting on the boot list, owed about $146 to the city, and hadn’t paid anything. Interestingly, at the 75th percentile, almost a year had elapsed between getting the ticket and getting on the list, but it looks like some, but not all, of the debt had been paid ($146 total payments vs. $183 currently due).

I did a few more analyses like this with the ticket_time variable I created, but felt a little discouraged due to not feeling like I could actually trust it due to its inconsistencies. But maybe this is still OK to give me a bigger picture or a hunch of what’s happening.

GOAL 2:

  • Take a look at the most common violations that land people on the boot list and see how they compare to the cash amount for each violation. Are people getting on the boot list more often because they get one or more tickets with really high fines that they can’t afford to pay? How does the boot list stack up with big ticket items vs. smaller ticket items?
  • RESULTS:

I did distribution plots of current_amount_due for the full dataset and compared it to the same plot of current_amount_due for SEIZ data only.

sns.distplot(df['total_payments'])
Total payments for the whole random sample of tickets

It totally seemed like people who got a boot list notice made less total payments, which makes sense, because 2–3 unpaid tickets means you will get on the boot list!

Total payments for those who got a boot list notice.

In the future it might be cool to take these two graphs and do an interactive overlay, where the graph changes as you select all tickets, just people on the bootlist, just people who filed for bankruptcy, etc.

I didn’t get to animate anything or make things beyond rudimentary charts, but this was a really good experience. I still have a few weeks left in the mentorship so I hope to be able to visualize this better before it’s all officially over. The lesson is that analyzing data always takes much longer than you think.