CODEX
Automation: Or Ways Not to Waste the Talents of your Pricey Data Scientists
Four Tools that Enable You to Free Your Data Scientists to Focus on What will Add Value to your Business
Automation is the second key pillar of a well-run, efficient and useful data science (or interchangeably analytics) department within any business. I’ve found it to be one of the most common weaknesses over the years. I spent a few years in consulting and I was consistently amazed to watch organization after organization solve their data science challenges by simply “throwing more people” into the department instead of looking for a better way to use the people they had. I saw the same thing at the first business I joined soon after my years in consulting: sharp analytical minds wasted on pressing ctrl-x/ctrl-v to move data around or building a dozen scattered pivot tables across Excel sheets (after a few painful vlookups) to try and join data sets to calculate some basic metric. Not to mention they seemed to think data science meant dozens of variations on Kindergarten-level counting (you know being able to count “1..2..3..4…5… 1,001 sales!” doesn’t make you a data scientist, it just means you passed Kindergarten). It wouldn’t surprise you that very little useful data science got done at that place because so much time was wasted on nonsense. If your organization is using significant chunks of your data scientists time to perform basic calculations, you are on the wrong path.
There is a better way. And the best part is that it doesn’t require a million dollars of software (I’m looking at you SAS!) and a dozen data scientists. It just requires 1–2 data scientists with experience in using a scripting language (e.g., Perl, Ruby, Python — I’m very fond of Perl), a SQL package like MySQL (free!) and Python or R(probably the best analytics software available for 99% of business analytics problems). Combine these three and you get a very robust, flexible, powerful and cost effective automation platform. But what do I mean by automation? In plain English automation means replacing a data scientist with a computer that can perform the tedious, complicated but necessary data flows (from point A to point B) and translations (from one structure or format to another) plus the basic, standard and routine analysis that a data scientist would otherwise waste his or her time doing (and more likely than not make an error because tasks that can be automated tend to be poorly suited for humans but ideal for computers). Data scientists are paid to think… Not to copy data across spreadsheets and apply simple calculations.
I’ve spent 19 years in the workforce split between highly technical work as an engineer in the defense industry, as a strategy consultant to private and public organizations, and as an analytics manager/director in the private sector. Between those almost 20 years and the prior eight years I spent in academia I’ve been exposed to practically every analytics tool that exists including Mathematica, Matlab, Octave, Scilab, R, custom C/C++, Java and so on. I’ve come to the conclusion that you really only need four:
- Perl or Python — It handles the problem of moving data across systems and translating between formats easily, efficiently and quickly
- SQL — Nothing beats SQL for storing and accessing data (and with MySQL it’s free)
- R (or Python) — The best (really!) tool for data science
- QlikView/Tableau — The best way today (although both are far from perfect) to visualize and explore data and analytics (but not a good way of actually doing the data science — these shine as interfaces though)
With these four tools, you can automate 80% of your organization’s data science and free your data scientists to focus on much higher value, complicated analysis. We will tackle each of these in greater detail in a future posting.