5 Essential Management Strategies For A Data Science Project

Vachan Anand
Analytics Vidhya
Published in
8 min readJun 19, 2022

--

After graduating from one of the leading universities in Australia with a master’s degree in Data Science, I was confident that I was skilled enough to ace any data work I did. However, it took me joining a consultancy to understand my journey in the world of data was simply starting.

Photo by Ferenc Horvath on Unsplash

In this blog, I will share my experience of how working with multiple clients has helped me understand the core difference between studying data science at a university and working on projects in the industry. We will look at the following topics to help you get started with your first data science gig successfully :

  1. Project Discovery
  2. Expectation Management
  3. Project Lifecycle
  4. Building Baselines
  5. Building Minimum Viable Product (MVP)

The blog does not focus on the technical aspects of a data science project but on its management, which is usually not a part of the university curriculum. Without any further ado, let us start exploring the topics above.

Project Discovery

As someone starting a career in data science, it is challenging to control the urge to dive deep into the data as soon as it’s made available. Although some interaction with data is necessary to understand the problem, I think the excitement for deep exploration at first sight of data hinders the need to understand the problem. On the contrary, I feel it is essential to be critical of the problem and refine the project scope before starting any development. The article linked below is an excellent example of how an unclear project scope can lead to grave circumstances which a better initial discovery could have avoided.

A sloppy problem statement almost killed a man twice!

Photo by Tom Def on Unsplash

Click on the link to read more about it.

It is crucial to follow these steps while starting a data science project

  1. Understand: Understand the problem that needs to be solved. As data scientists, we not only need to be aware of the technical challenges involved in the project but also the expectations involved in solving the problem. It includes making notes of business and project constraints such as budget, resources, time frames, accessible data sources, et cetera, as they affect the feasibility of the end product.
    Moreover, it is essential to understand the use of the final results while designing the project scope. The business may require additional capabilities such as a BI tool or web application to utilise the project results, increasing the build time.
  2. Refine: Based on the initially available information, such as data, project constraints and desired outcome, it is crucial to make a judgment on the success of the project given the constraints.
    Moreover, machine learning models are built to solve specific issues; therefore, for a generic use case, the project’s scope needs to be refined enough to support the analysis and make reasonable models. Refining the scope may include performing research on projects to understand challenges other scientists face in similar domains.

Expectation Management

Given the success of Amazon in recommending products and Tesla in driving cars, it is exciting to think of machine learning as the make-rich pill in the current industry. More often than not, machine learning solutions are considered a silver bullet to solve any/all problems and hence comes the unrealistic expectation associated with them. To understand this challenge, it is worth understanding that machine learning is still a very new area of application. Although some organisations have used it over the past few decades, it was not until the recent democratisation of the field that supported most organisations to hop into the new era of data and analytics. Therefore, it makes it challenging for business stakeholders with little to no experience in the domain to comprehensively understand machine learning/data science and the challenges associated with such projects.

Photo by ThisisEngineering RAEng on Unsplash

Therefore expectation management is crucial, especially during the project’s initial phases. We can make use of the following steps to achieve it :

  • As data fanatics, we must support and promote data literacy amongst stakeholders. Data quality affects the time and money associated with completing a project successfully. For instance, a machine learning project may not be viable on a tight schedule if the underlying data is too messy. It is critical for a machine learning model to be built over a well-designed data platform for it to thrive, and it is the data scientist’s responsibility to make the stakeholder aware of these dependencies before starting the project.
  • “Data Science”, as the name suggests, is analogous to other fields of science, i.e. it includes repeated experimentations to build reliable models. Like the science behind developing a new drug, it includes analysing samples, testing hypotheses and RnD as part of the model development, thereby increasing the time associated with building reliable systems.
  • The time required to complete a project also depends on the complexity of the problem. Therefore, the stakeholders need to be aware of the reasons and the risks & rewards associated with time spent on each phase of the project.

Project Lifecycle

A data science project is a blend of multiple components. It includes analysing and visualising data, feature engineering, experimenting with models, training and hyperparameter optimisation. Getting lost in any of these domains is relatively easy as some domains are not time-bound.

For instance, there is no limit to features we can create, the amount of data we can accumulate, or the analysis we can perform over a given data set. This lack of an absolute state makes it challenging to track the entire project’s state and manage stakeholders.

As technologists, we get carried away by the technology; however, a business usually operates by the law of money.

Unless the organisation’s niche is data science, most organisations do not appreciate pumping money into a project with no foreseeable tangible outcome. Therefore, in the industry, it is crucial to run machine learning projects in short cycles. It involves:

  • Building the machine learning model as a minimal viable product with limited capabilities as opposed to perfecting it
  • Delivering the model by integrating it with the application, such as Power BI, to derive early business value.
  • Improving the MVP by using additional features, better parameters or switching models.

Building Baselines

From a technical perspective, one of the things that has helped me out while building a machine learning model is not spending too much time perfecting a model.

Perfection is the Enemy of Progress — Winston Churchill

Instead, it is crucial for us to build a baseline model, much like a Proof-of-Concept, that serves as a benchmark. Furthermore, the baseline model can be improved over time by switching the models, incorporating additional features, introducing concepts like sampling, or anything else deemed necessary for the project.

Photo by Aron Visuals on Unsplash

Essentially, a machine learning pipeline can be considered as an interaction of several microservices — for instance, a microservice for generating features and another for cleaning data. Once a baseline model is ready, we can focus on individual components independently.

We can quickly achieve a baseline model using any of the following ways:

  1. Using a third-party Out-Of-Box solution reduces the project’s build and delivery time.
    For instance, if we use a cloud platform such as AWS, we can use Sagemaker’s pipelines for orchestration, Data Wrangler to analyse and clean the data, and container images to use pre-built models.
  2. Building a custom model. If it is required to create a custom model owing to the niche nature of the problem, then custom baseline models can be made using Jupyter notebooks. However, it is recommended to timebox the tasks so as not to run the project over budget.
  3. Building hybrid solutions. There might be instances where it is worth spending most of the time creating a single component and offloading other tasks to third-party services. For example, in a data science project, there might be a case where we wish to build a custom machine learning model and not focus on other components such as hyperparameter optimisation due to the repetitive and time-consuming nature of the task. In such as case, we can outsource hyperparameter optimisation to a third-party service such as AWS Hyperparameter Optimisation Jobs. It helps us build baseline models quickly, saving project resources and time.

Building Minimum Viable Products (MVP)

One of the critical challenges for a data science project is the underlying infrastructure required to satisfy the analysis and models. As data science projects are required to process enormous datasets, they usually need expensive hardware/software that can support the project’s development and application.

Photo by Amélie Mourichon on Unsplash

Therefore, building an MVP out of the baseline model is recommended. It has the following benefits:

  1. Building an MVP helps the stakeholders estimate the overall cost of running the project in production.
  2. It starts delivering valuable insights to the business, enabling stakeholders’ trust in the project.
  3. Additionally, it helps the solution architects to update the infrastructure, product design and project estimates based on a light version of the project.
  4. Most importantly, it gives the data scientists an idea of how to model the results of a machine learning project so that the business can utilise it effectively.

Conclusion

Data science projects have been notorious for being a black box. It leads to challenges such as unrealistic expectations by the business and its stakeholders. However, data scientists can reduce these challenges through effective project management. This blog looked at some of the methodologies any data scientist can incorporate in their next project to mitigate such challenges.

--

--

Vachan Anand
Analytics Vidhya

A consultant with an interest in Data Science, Data Engineering and Cloud Technology.