Before writing the first line of code in a Data Science Project

Pratim
5 min readAug 17, 2020

Surprisingly today in the world of data science apart from cleaning data, building, tuning, and deploying models being the primary aspects of a project there are few other key areas that usually get unnoticed during the initial phase of data science project. So here is an attempt to key down some of which I have recently encountered and feel their importance in the successful delivery of a project.

Inception of the project :

Engaging with Business Stakeholders :

Business Analysts / Senior Data science Leads are in continuous talks with the stakeholders in different verticals like supply, finance, IT,HR etc to keep themselves updated with business processes and current challenges in those departments. During these discussions, they propose plausible solutions to the existing challenges using AI/ML. This plays a key role in an analytics department of an organization, as taking in confidence the stakeholders and showcasing the capabilities of the team play a huge role in the successful delivery of projects. Once the stakeholders get convinced of the possible solutions, the project gets their sign off, and hey presto! now we have a viable project in hand.

Use case building:

Define Problem statement :

This is actually what the name suggests. Defining the problem statement neatly is an important aspect of the whole project life cycle. Here the focus should be on defining it simply so that even a non-technical person can understand it. This would need some good writing skills such that in small and crisp statements the problem is defined without missing any important attributes. This should be also signed off from stakeholders at the beginning of the project and should not be changed (typically) across the life cycle of a project.

Define Scope :

Defining the scope early and vividly goes a long way in the successful delivery of the project. Also, we should give special attention to what is out of the scope of the project and document it. Getting a sign off on the scope from the stakeholders should not be skipped during this stage.

Assumptions:

We should point out explicitly what are the assumptions of the project. Something very lucid and clearly understood by everyone during meetings should also be added here. As the background of stakeholders and technical people are vastly different, so something very obvious for business might not be so from the data scientist’s perspective and vice versa and this may lead to undesired results if the assumptions are not clearly spelled out.

Current/Proposed Process :

We need to understand the current workflow of the system — who are the key role players in the system, how the system is behaving currently and where are the current impediments. Drawing a current architecture of the system that we are dealing with is crucial at this point along with where the proposed AI /ML solution fits into the existing architecture and how it will make an impact on business.

World of Data :

Data Availability :

Try to get answers as to how the final live data will be made available to the new proposed system . Key factor here is that we should not think only of the minimum viable product(MVP) which might be currently built on static data shared by business but also start thinking on the lines of how the MVP can be scaled /industrialized to the next level so that it helps business in production systems. Many MVP’s fail to scale to the next level as this step is given the least importance at an early stage and later scaling becomes a humongous task in hand.

At this stage we should try to get information on few more data related questions and should be confident of :

What are the current data sources.

What are the current data generation processes.

What are the current data storage and management processes.

What are the current data collection processes and what is its frequency of data collection.

What are the data tables involved and the types of the relationship among them.

What are all the data fields that are involved in the data.

Finally, we should build a data dictionary for each data field involved -eg Name, Description, Data Type, Mandatory/Optional, Remarks, etc. This part is a savior at various points in the project and should be built with close discussions and sign off from business.

Feasibility study :

Once we get the data from the business we debate on the feasibility of the project. It may so happen, more often than not that what business is wanting to solve and the shared data does not contain related data points. So feasibility study is a key factor in the life cycle and we conduct this study by doing a data quality check which involves few parameters.

Few standard data quality checks :

Completeness: Ensuring that there are no undesired gaps in the data.

Accuracy: Data collected is correct, and accurately represents what it should.

Timeliness: How up to date is the data. Does the data depict the problem?

Consistency: The data should have a similar data format throughout.

Relevancy: Is the information available useful to the problem statement.

Success Criteria

Defining an agreed success criteria is a must during initial discussions with the stakeholders of the project .

Current performance level:

We should attempt to get the current process performance level of the system from the business and try to improve on that with the proposed new AI/ML solution . While business might not be having or comfortable in sharing such numbers but it should be called out during business discussions .

Expected performance level :

Here we define and explain business what will the KPI be and its expected value for the proposed solution that will determine the success of the project. Some usual KPIs are MAPE ,Accuracy, Precision, LogLoss etc. An example of success criteria is as follows

Accepted success criteria of the current prediction model of MVP from business is -Precision should be greater than 0.85

Also here I remember a saying from one of my colleagues -‘A project can be successful from a data scientist perspective but may not be same from business side “ . So defining the problem statement crisply and success criteria clearly plays a crucial role in a successful delivery of a data science project .

Solution Approach :

Now the core work starts for the data scientists and various solutions are experimented to meet the business success criteria.

Summary

In conclusion,during initial stages of building a minimum viable product there are crucial non coding areas as discussed above where data scientists should spend some quality time for successful delivery of a project.

--

--