ETL + Database

Samuel McCormick
Data Divas
Published in
5 min readNov 17, 2017

As mentioned in the previous sections, there is abundant crime data available for many different cities around the world. We decided to focus on San Francisco as a starting point to develop our models.

We obtained SF crime data through an elapsed Kaggle (my current Chrome homepage) competition, which originally derived it from the SFPD Crime Incident Reporting system. Extracting this data was exceptionally easy, we just clicked this big blue button called “download” to get training and testing data in the appropriate csv format. You don’t need a masters in data science for that!

We then started brainstorming to determine factors that could affect the potential occurrence of crime and decided to start with the following data sources: SF eviction data, Open311 data, and weather data (we’re planning to add more as the project continues to develop). These data sets required some cleaning steps, such as the need for separate ‘X’ and ‘Y’ columns representing the latitude and longitude. This involved write RegEx code to separate these from a string containing both. We expect to generate something along the lines of a heatmap indicating high potential crime locations and hence a lot of our work is to do with gathering spatial data and creating visualizations of factors affecting crime using GeoPandas (disclaimer: no pandas were hurt in the making of these visualizations).

Other major factors which we expect to work with soon include: education levels, poverty levels and transportation systems for the city/region in consideration. We expect these to potentially be less localized (less dependent on the very specific latitude and longitude) than our current data sources, but still important factors affecting crime.

Having built a framework of what factors we want to include in our model, we ambitiously proceeded to extract a multitude of datasets. In an effort to organize our data, we developed an initial star schema and began loading our data into SQL. As we examined the datasets more closely, we began to notice new challenges our initial schema failed to capture.

We anticipated to use the date as a primary key in each of the dimension tables as well as the fact table. As it turned out, the date was defined differently in each of the datasets — not only did the formats vary, but also the feature “Date” sometimes included the time separated by a space (both in 24h format as well as 12h). In the crime dataset, which lies at the heart of our project, the “Date” feature included both date and time. This ensured a degree of specificity that proved problematic given our desire to merge — given a crime committed at 2015–05–11 22:12:00 it’s highly unlikely to find an eviction, a 331 report, and specific weather data for that exact moment. We ran into a similar issue with the coordinates — there is only so much data you can find for the specific location -122.41947048851499, 37.762270173991396. As a quick fix we decided to create our own IDs for each observation as we inserted them into SQL tables, however we will have to decide how to group our data, which is unavoidable given how precise our data is. We are currently workshopping the idea of using date (year/month/day) and district and thus have a record of how many crimes, evictions, reports, rainy moments, etc. occurred on a given day in a given area.

As we passionately filled SQL tables with rows of information we began to wonder — should ETL become ELT? The traditional set up is intuitive — data is often messy and disorganized which begs for transformation before loading. However in our case, we are working with datasets from websites specifically created for data analysis. Though sometimes the formatting does not always match our particular needs it is, at the end of the day, systematic. As a consequence, we chose to load the datasets as they were, modifying and reformatting the data only upon retrieval. Though many are far to extensive and contain features that we most likely won’t make use of it’s best to have them all in case our model changes or our framework expands.

Our ETL process will be an iterative one — as we discover new factors that might affect crime, we will likely look for new data to incorporate into our project. The challenge we will inevitably face is how to organize our star schema, and in particular — our fact table. This ultimately depends on what specific units we want to look at — each individual incident of crime, all crimes in one day and area, specific types of crime in a month, etc. Seeing as we are in the initial, exploratory stage of the project we have the luxury of not having to specify that just yet — instead it’s best to examine the data and, as we become more informed, figure out the best way to organize it.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Update

Upon closer examination of some crime prediction source code we found, we were shocked to discovered how much of the data cleaning/ wrangling and pre-processing was done in SQL. This opened our eyes to the many exciting possibilities that SQL opens and forced us to rethink how we were organizing our data. Though it still makes sense to load the entire datasets as they are, we will have to create many new tables based on the raw data we have as well as rethink some of the variables.

Seeing as all of our datasets have a special component which is very important to our project as a whole, a big issue was figuring how to group them together. Initially we decided to go by neighbourhood but that turned out to be too broad — our project becomes more powerful the more specific it is, that is — the more precisely we can predict the location of a future crime. Consequently, we decided to organize our data by census tract. Thankfully MySQL turns out to have a plethora of packages that handle spatial analysis, and in particular testing special relationships (in this case determining whether a specific set of coordinates belongs to a particular census tract) and special indexing.

The next steps are to create tables with information for each specific category (crimes, evictions, 311 reports, etc.) that contain census information.

--

--