I get asked a lot by folks looking to break into data science what tools I use/would suggest learning. So I figured it deserved a post in itself. 
 First an aside:
 Tools are a means, not an end. They are there to help you achieve a goal, they are not a goal in and of themselves. So don’t use a complicated machine learning model when a simple count will do- this exhibits poor judgement, especially in commercial contexts. But the right tool wielded at the right time will make you faster, and in some cases solve hitherto unsolved problems.
 Having said that, tools I use regularly:

  1. Databases (mostly Postgres). I list this first because the first thing I do when I get access to a new dataset is load it into a database. That way I can easily poke at it (SQL is great for this) and extract counts of the various entities of interest. I also get it structured for reuse- I never know when I (or one of my teammates) might need the same data for a different project. I like Postgres for most things because it’s awesome.
  2. Math (mostly Conditional Probability). Once I’ve got counts/frequencies I turn to extracting information. Each problem has its own flavor, so I’ll spend some time building the set of metrics and methods that make sense for the one at hand. I don’t use fancy methods or algorithms unless I absolutely have to, but the way I calculate a correlation might vary problem to problem, especially if I need specific conditioning. I’ll always do some analysis (of the mathematical variety) to understand how different metrics might perform and how they will be affected by the assumptions I’m making.
  3. Python. When I have a method I’m comfortable with, I’ll want to implement it so that I can (a) run simulations under conditions similar to those I’ll use for the analysis and (b) bring it to bear on the data. Python is my go-to, for a number of reasons, the primary ones being:
    - It has a mature set of analytical/scientific libraries.
    - It allows a clear path to production (R in particular lacks this).
    - It’s very easy to modularize code for use in other projects.
    - It’s free.
    (Quick Aside: Python v. R- I know people have strong opinions on this. I’ve written mine. I don’t think this is a thing folks early on in their development should get caught up in.)
  4. Version Control/Centralized Repositories. Once complete I’ve got an analysis on structured data. I might need to ship this as a data product. I might need to share it. I might need to run it later on a different machine. I might have useful modules that I can use for different things. Git/Github is my current choice for managing these things- putting my code in a central location for change management, sharing, and deploying.

That said, I don’t think folks should over-index on my experience- there is a very specific confluence of events that brought me to my preferences. Instead, I advocate the following takeaways:

  1. Learn to take generalized approaches to storing data, so it can be easily accessed and reused. Files aren’t a great medium for storing data you plan on using programmatically. Spending some cycles learning how to structure data for access and reuse pays off huge dividends.
  2. Understand the mathematical foundations. It’s really easy to grab various libraries and shove your data into them to get a “result”. But, as software engineers are fond of saying, “Garbage in, Garbage out”. Working through the mathematical foundations of methods you plan to use and how they will apply to your problem, even in simple cases, will take you very far.
  3. Use open source tools. I see no earthly reason that anyone should be using analytical software that you need to pay for in today’s day and age. The open sources tools (Python, R, Julia, and the like) are that good. Moreover they are improving at an incredibly fast rate, since major technology players are highly invested in their development. You risk falling behind if you’re not using open source tools.
  4. Learn to modularize so code can be reused and shared. Using version control forces you into this mindset, since you don’t want your repos to proliferate. But seeing the commonalities among the things you’re doing and building modular components to take advantage of them is a highly valued skill. Again, a thing that will take you a long way.

I hope this is helpful to those entering our field or who are looking to upskill to reach the next level in their careers. Feel free to ping me if you’re in this phase and have questions- I’m also happy to help where I can :-).

One clap, two clap, three clap, forty?

By clapping more or less, you can signal to us which stories really stand out.