10 things I learned doing my first data science project

10 and a half things, really.

Johanan Ottensooser
Datalogue

--

Johanan Ottensooser, CSO of Datalogue, discovers how to data science the hard way.

I’m working on a data science project — using Datalogue tools to make my decision about which neighborhood in NYC to live in.

But this is my first data science project.

And my first time using data science libraries (specifically numpy and pandas).

And my first messing with Data outside of excel modeling (Professor Juran, sorry for defecting!).

So I made a bunch of mistakes, and I learned a lot.

1. Know what you want to look for, not what you want to find 🔮

Define your problem and your method, not your solution.

My first data science project was trying to find which neighborhood I should live in. I concede, I have a beard and a bike, but it wouldn’t be statistically rigorous for me to choose Williamsburg without collecting data.

Dramatic reenactment of me leaving my future apartment. Filled with happiness because my decision was made with data. Resemblance is strong.

So I decided what I wanted: the best neighborhood for me given certain constraints.

To make this decision data driven, I decided what defined “best” to me: things you would expect from a bearded cyclist: cafes, art, bars, etc.

Tune in next week to see where the data pointed me to. It’s a bit of a curveball.

So keep an open mind with your data science, and you might just be surprised.

2. Look at the data 👀

Not just in the abstract. Really look at the data.

I spent two hours working through a couple of datasets (one with census tractsand neighborhoods and one with census tracts and zip codes). Figuring out how to integrate them with my (exceptionally simple) dataset smooshing algorithm. I wanted to do a join on both of their census tract columns (census tracts are the way the census refers to chunks of land). I wanted to create a database with zip codes and neighborhoods. Sounds easy, no?

If I’d have looked at the data dictionary (Socrata’s TL;DR of the contents of the database) I’d have known that the one dataset used 2010 census tracts and the other one 2017 census tracts.

Not looking at the data gave me a join with no hits. The ’10 and ’17 census tracts were completely different, a completely failed join. Something I could have totally avoided by actually looking at the data.

Actually looking at the data would have saved me two hours and just a smidgen of dignity.

3. The dataset you are looking for usually exists, just search 🔭

The strangest and most beautiful datasets are floating around the internet.

I found every single dataset I needed. Sometimes, like with the tracts example, the data didn’t do what I wanted it to. But, then I found another dataset that did exactly what I wanted, in the way I wanted.

There’s a lot of rich data available.

Places to look:

  • Don’t discount some good Google-fu
  • Socrata is awesome, especially NYC’s open data portal (and Socrata works with Datalogue tools out of the box)
  • WNYC has some great datasets, especially about quality of life
  • More companies than you think have APIs that allow you to grab at least some data, it’s worth a shot

4. Python notebooks mean you don’t need to repeat the slow parts 🐍📓

I know, you feel a little bit like Neo when you are calling your python scripts from terminal (or, if you have Bryan in your team, from your awesomely customized Oh My ZSH skin in iTerm).

But that runs everything at once. And when you are iterating a function through hundreds of thousands of lines, waiting gets boring.

Notebooks aren’t just a pretty way to show / share your code. The way that they run your script can save you literally hours (they only re-run the parts of the script that you ask them to).

I used Jupyter Notebooks. It was awesome.

But … you don’t get as good syntax highlighting or as good treatment of your internally defined functions as you would in an IDE (I use IntelliJ when I Scala). It’s up to you decide the trade off.

4A. But don’t forget to re-run the notebook when you change things 🤦‍

Only running one chunk of code can save you a bunch of time. But it can also lead to some embarrassing and time consuming mistakes.

Importing a database, then transforming it, then analyzing it.

The analysis isn’t working so you debug and change the transform and the analysis. But you only re-run the analysis. And you are confused as to why nothing is working. Its because you didn’t re-run the transformation.

It wasn’t just me who fell into this trap — one of our engineers was helping me with the slightly awkward syntax for editing pandas data frames, and we spent about 15 minutes trying to figure out where we went wrong before we re-ran the whole script. Ta-dah! Fixed. But, just a touch embarrassing for both of us.

5. Functions of functions can be more readable and more usable than megafunctions ⚙️(⚙)️ > ⚙️⚙️⚙️

When you have functions that do a whole bunch of things, it is often better (and definitely more readable) to make a few functions that each do one thing, and create a master function that calls those functions.

This is easier to understand.

Me trying to debug an enormous function that I should have made using lego-like pieces.

It also allows you to use your code in a bit more a lego-like fashion, reusing blocks where appropriate.

I know this is like programming 101, and I’d heard this before. But this is the first time I had suffered because of it. Trying to work my way around an enormous function to fix a bug became so much easier when I split the function up politely.

While we are at it. Comment! Comment a lot! Better than not commenting and confusing the hell out of everyone.

# Comments make your code readable. 
``` So are Docstrings.```
# This is important.
Functions within functions may have come up earlier this week!

6. When you need to use variables, make their names human-understandable 🤷

If you don’t need a variable, don’t use a variable. It abstracts your code once more. It makes someone who’s working with your code have to read back to where the variable was defined to understand what that variable means.

Speaking of which (and I totally haven’t mastered this at all), try to use variables whose names make sense. When you are writing, you might totally get what comp_var_col means. But no-one else will. Write comparison_value_column, and if you can’t be bothered typing, use a fancy text editor with variable autocompletion (IntelliJ does this, so does Atom and Sublime and pretty much all other code-first text editors).

7. Default arguments are your friend 🔀

Speaking of which, sometimes functions only sometimes have to do some things.

Do I need to count entries in a database, or do I need to count entries that include the word empanadas?

yes empanadas

The same function can do both! Set compare_boolean = False as the default argument, and have that run through the “vanilla” counting part of the function—counting entries in a dataframe.

If the user enters compare_boolean = True, then that OK’s the if statement that will take you through the second part of the function, counting empanadas within the dataframe.

I did something in the gists below, if you want to see what it looks like in action.

That way you can call the same function on either usage! Much cleaner!

8. Be careful how you store your data 📚

Python has quite a few ways to store data. And outside of Python, there are more database types than I know what to do with.

But they aren’t the same at all, and they aren’t read by a computer in the same way.

They optimize for different things!

I used lists where I should have used dictionaries. Nicolas Joseph slapped me on the wrist and told me to watch my data types.

Lists are good for cycling through. Dictionaries are better for searching through. The effect on performance can be tremendous.

9. Loops of loops suck, really, they suck (or less code ≠ faster code) 🎢

If you ever find yourself looping through a data frame, and within that loop, looping through another one, you’re going to have a bad time.

This causes the number of iterations to multiply.

Going through a dataset bigdata of 70k rows, whilst looping through another littledata of 1000 rows for each row of bigdata means 7 million iterations.

Going through bigdata, creating a dictionary of results and matching the results to littledata has only 71k iterations.

This isn’t just theoretical. Getting rid of this turned one of my processes from taking 15 minutes (no joke) to about 5 seconds. Crazy.

No loops in loops!

This function looped through a dataset `tally_dataset`, but searched for each row in `feature_dataset`. This is bad. This took 15 minutes for my computer to run through.
This function looped through `feature_dataset` and recorded the results in a dictionary (I am creative, I called the dictionary `dictionary`). After looping through `feature_dataset` completely, recorded the results from the dictionary to `tally_dataset`. 17 lines instead of 10 lines. but it ran through the same data as the above in 5 seconds. Compared to 15 minutes. Orders of magnitude faster.

Note, I totally didn’t obey my own rule of naming variables in a human way in the above gists.

10. Don’t be scared to ask for help 🙋

Every data project is special, but the problems you face whilst doing it probably aren’t.

A cursory Google search will send you to the right stack-exchange, the right documentation or the right blog post in seconds. Asking a mate (I ask Nicolas Joseph way too much) can save you hours.

Don’t spend your time solving a problem that someone else has solved, ask around and get help, and focus on what makes your project interesting.

Should you start your own data science project?

Yes.

It is fun. It will teach you about data. About programming. And, if you do a project like mine, a little bit about your city.

One word of warning, there’s a cliché about coding being a mechanical and cold. I agree with Anil Dash—it isn’t, it is an emotional rollercoaster.

Tune in next week for my conclusive, decisive and data driven solution to the age old question: Which neighborhood would Johanan like to live in most, according to his own completely arbitrary criteria and publicly available data. More data adventures with Datalogue to come.

A completely accurate emoji representation of this project.

--

--

Johanan Ottensooser
Datalogue

Customer Success @ Datalogue. Fintech @ KWM. Cornell Tech LLM v1.