From literature review to production in 8 weeks

Ben Thomas
The Startup
Published in
5 min readJul 19, 2019

One of the most common questions I’ve been asked by recruiters, interviewers or even fellow Data Scientists is ‘how many models have you actually taken to prod?’, and although it may come across as a bit of a weird question, I think we’d all be surprised with how many Data Scientists have worked in the field for years and not have taken anything significant to production. Now the reason behind this question makes more sense when paired with an understanding of the optimistic nature of a Data Scientist. Over the past few years of working in Data Science and working along fellow Data Scientists, I have come to the conclusion that Data Scientists are an optimistic breed as they tend to over commit to work, and then under-deliver. Not because of any flaw with Data Scientists, but because we over simplify the work and are often too eager and jump straight into the modeling without fully understanding the problem at hand and which approach might seem best to take. However, as much as the nature of the over optimistic Data Scientist may seem like an interesting topic, it is not the reason for this post. This post is aimed at combatting the over commitment and under-delivering aspect of Data Science by helping add some structure to an 8 week delivery cycle. With that, let’s begin.

Week 1: Literature review

The most important step in tackling the Data Science task is making sure that you are able to define the problem. A literature review does exactly this, by bringing you up to speed with why it exists and what techniques have been used to address the problem. The literature review needs to be extremely thorough, to ensure that you have no blind sides and that you have as much insight into any solutions as possible. Make sure that you go in depth into the inner workings of the problems presented, understanding as much of the mathematics and statistics as you can so that when you select your decision, you do not only understand the what, but the why too. I cannot reiterate the importance of the literature review and having several solutions enough. Undoubtedly some of the solutions will fail (and for various reasons at that), so it is important to always explore more than one solution when carrying out the literature review.

Week 2: EDA

So, what’s next after a literature review? Well, EDA or Exploratory Data Analysis. Once we have defined the problem and listed a handful of solutions to address the problem, we need to understand our data. EDA allows us to do exactly this, to wholistically understand the strengths and weaknesses within our data. Based on these strengths and weaknesses of the data, some of the solutions explored during the literature review will become more or less relevent. Once again, this step needs to be rigorous; don’t merely understand the data at face value, but understand it at a level of granularity such that you immediately know which solutions are feasible and which are not.

Week 3: Feature Extraction/Engineering

Once we have understood the problem as well as our data, we need to start exploring all viable solutions. Here we start off by massaging the data such that it can work well with some of the solutions we shortlisted during the literature review process. We need to massage the data and extract as many features as possible from it. More often than not, we can only extract so many features from the data and so this extraction process would need to go hand in hand with feature engineering. Try to engineer features which add significant value to your model by further separating labels or points from one another.

Week 4: Model Development

Finally, we can start building the models. We still don’t yet fully know which models will do better than others, but hopefully we will have a shortlist of around two or three models based on the work done in the previous three weeks of the cycle. Fit the data to your models and fine tune them as much as possible. In most cases a week should be enough because most Data Scientists will not be building models from scratch but rather using a framework such as the likes of PyTorch, Keras or Tensorflow. So this week should allow you to build a couple models to have a fair comparison between all approaches and determine which one performs best.

Week 5–6: Architecture development

Now that we have found and implemented a solution to the problem, we need to start building the architecture around the model to help support it in production. The complexity of the architecture will vary from one case to another. Sometimes the challenge might not be too complex and so a mere two weeks is sufficient, however in other occasions the challenge might be so complex that the architecture development process would need to take longer or even run in parallel to the prior five weeks. If this is the case, it is of the utmost importance to work hand in hand with the architects. Set up an agreed upon foundation whilst carrying out the literature review for the architects to start building, and only once a model has been chosen then finalize the finer details surrounding the architecture (such as the format of a feature store).

Week 7: Testing

To summarize, what we have done thus far and what we still need to do: we have a model, we have the relevant architecture surrounding the model, so next we need to start testing the model in a dev environment. We want to be able to mimic a production environment as closely as possible so that testing is a fair reflection or representation of what we can expect in production. Teething issues will undoubtedly be experienced and are in fact expected, and the purpose of this week is to sort these out before going live to prod.

Week 8: Production

It’s prod time, baby! Although we have tested in a dev environment, further hiccups are still expected. Spend the week setting up a good production support system so that if (when) such issues arise, steps are already in place to help kickstart a debugging process. Things like daily checks, automated scripts to signify when a system or subsystem is down etc. are all key aspects to consider when setting up the production system.

With that, you’ve managed to take a problem from start to finish to prod in a mere 8 weeks. You solution might not be perfect, in fact it probably won’t, but it’s in production and it is adding values. So what’s next? Well, now that you have something tangible in production, start to incrementally improve on it. Possibly spend some time reflecting on what is working well and what isn’t, then spend another cycle implementing these changes while refining all the possible shortfalls of your solution. Since you have all the necessary systems in place, you should be able to make rapid (but effective) seamless changes — seamless to the point of probably being able to push a new version to production on a weekly basis.

Although this post proposes taking a Data Science problem to prod in 8 weeks, it is merely one of many possible solutions which happened to work for my team and me. This solution won’t work for everyone, guarenteed, but hopefully it might just work for you.

--

--