The Mayans’ Lost Guide To Doing Data Science In Fast Paced Startups

Saurav Agarwal
HealthifyMe Tech
Published in
5 min readSep 12, 2020

A machine learning model is useful only if it’s put to production at the right time.

No kidding!

The quest is what are those key practices left unsaid while working in a fast and highly chaotic environment? Are the AI foundations and computer science degrees enough to make scalable models in real life? It’s not as easy as eating cotton candy.

Driven by intellectual progress, I decided to do a review of my past experience in working with Machine Learning and Deep Learning models. In this article, I’ll be using my work experience on one of my recent projects where the aim was to predict a user’s purchasing probability for our various offerings. The purpose of this project was to help the Performance Marketing and Growth Team at HealthifyMe to evaluate their efficacy and efficiency of our sales engines respectively. I am sharing below the insights I’ve had from the same on what one needs to strive towards while solving data problems:

  1. Writing a code that’s modular and reusable. Data cleaning and preprocessing does take up more than half the time of a modeling journey. Redundant work can be greatly reduced and debugging made easier on writing generic parameterized functions.
    We, in HealthifyMe, created many generic public transformers for this which made further iterations of the model quicker and more simple. And the best outcome: they can be reused for any standard data modeling in the future!
  2. Writing both unit and integration tests (so what if it’s an AI model?). This is a vital practice especially when the code is scattered in various modules and has to go through constant iterations. A single module might work but fail when combined with data pipelines.
  3. Gradually iterating the model. It is best to add one set of features after another instead of throwing all the features in it at once, expecting the model to do the magic. Gradual iterations give a greater understanding of feature importances. Releasing small models iteratively would also keep the stakeholder’s boat moving.
    At HealthifyMe, we decided to launch the initial version of our current model two months prior to the New Year 2020, the peak period for business, rather than waiting on its perfection, which would have come with a great opportunity cost.
  4. Documenting the tiniest bit of progress. The more attention given to details, the better. Sounds like a boring chore, but it helps tremendously while working on complex projects that involve too many moving pieces and uncertainties.
    Among us, this made all interactions with the data engineering team super smooth and efficient. In retrospection, I feel getting them more involved in the initial stages of the journey would have actually helped us as it would have given us an understanding of their concerns.
  5. Behaving like an owner. You, and only you know where things could get really messed up. Take up responsibility and make sure that the business stakeholders, data engineers, and all other teams associated are in the loop throughout the process so that the work cycle remains unaffected. Business stakeholders may get frustrated real quick in the face of ambiguities.
    Our team in HealthifyMe, scheduled regular brainstorming sessions with the stakeholders. This was really valuable as it helped in maintaining the morale and the exchange of quick feedback. It also helped in planning their sprints/timelines better. This was a great lesson in managing the unknowns. Planning is a great life skill to have when your games are not so finite!
  6. Making sure that the retaining mechanism and model evaluation processes are automated. (Even if it comes to extending the development phase in your sprints). The complete cycle of Data pulling + Retraining the model + Evaluating the iteration + Pushing to production, should not require another data scientist reworking on it from scratch.
    We were quick to learn from our mistakes when we realized how much of our engineering bandwidth was spent on just trying to crack the processes involved in the previous model.
  7. A/B testing on the previous and new models. This is again a vital step in understanding delta improvement. It’s a good practice to compare each feature to judge the efficacy of the latest model. Building data algorithms is a serious mental investment with a lot of uncertainties in terms of results. Hence it is psychologically rewarding for data scientists to witness the potential impact.
  8. Keeping the model updated. Models deprecate with time in an environment where the business needs are rapidly changing. Product/strategic changes are frequent and inconsistent as we continuously strive to keep the user experience the top priority.
    In dealing with the same, we created monitors and alerts to keep an eye on when the relevancy of the model falls and would need tweaking/revisiting in the complete approach.
  9. Being super conscious of the timeline. This is extremely important to projects that have the potential to paralyze you due to over-analysis. Even if all the steps are crystal clear in your head — it’s wise to give a 2X timeline of what you think should be the actual timeline.
    We made the mistake of giving 0.5X and ended up taking 2X of the actual time it should have taken in an ideal scenario. An exemplar bad way of managing the expectations of the teams involved. It’s always the smartest decision to give timelines for each iteration to avoid surprises.
  10. Keeping up the partnership with stakeholders. A proactive expectations management with stakeholders plays a pivotal role in an organization’s data projects. This point cannot be stressed enough and needs to be at the forefront in every stage of the work. It should be one of the KPIs for a Data scientist to be able to do stakeholder management efficiently and educate them on how to use insights from the model in making decisions in different strategic scenarios, at the same time.

Saurav Agarwal, Senior Data Scientist at HealthifyMe published this blog in hopes to make any organization that strives with its product, engineering, and data teams, understand the underlying intricacies and effective ways to solve them while pursuing their data strategies.

--

--