Learnings from handling a Data Science Project

This blog post will outline my experience of handling my first data science capstone project for a highly reputed client. The opportunity came through my University where I pursue my master’s in data science and Innovation (MDSI). Part of my subject Innovation Lab project (iLab) we will engage with a partner industry for a period of 3 months to address their problem statements/data challenges thereby it provides the opportunity to play the role of a data scientist. What I hope to achieve through the professional showcase component of the Innovation Lab project (iLab), and how I intend to go about getting it done. I will now detail about what I’ve already done as part of this experience.

My Learning Goals

  • Need to understand the problem before attempting to solve it, and at the moment the most significant challenge to solve is the setting of objectives.
  • Learn to code and build data model in a pythonic way.
  • Perform data processing and wrangling.
  • Improve data visualization, data storytelling skills.
  • Ability to communicate and collaborate.
  • Maintaining stakeholder’s values and behavior of business

Client Objective

The project assigned to me is to glean insights and build a strategic sales dashboard using any viz Tool to see product performance based on several factors. I chose Tableau as my viz Tool. Next one, to predict the product revenue for the next 5 years.

Setting the Scope

This project equipped me with project management skills. That is not purely accepting/nodding to everything provided by the client rather being super conscious with time and what can be delivered. Its all about refining the scope to fit into the realistic space. This provided me a pathway for exhibiting my leadership skills.

Framework/Methodology

The CRISP-DM methodology provides a structured approach to planning a data mining project. I had selected and adopted this framework for my project as I am familiar with this approach.

Data Preparation

Based on my past learnings and thoughts shared by executives, this is the most vital part of any data science project. The outcomes of the data model are highly dependent on this stage. Nearly 80% of the time is spent on data preprocessing. Excel & R language has been a great companion all this while for my cleaning activity. Tableau has come up with a prep tool this year to facilitate to ease the data preprocessing to a certain level with a drag and drop features. This tool is of immense help to me to speed with my delivery.

Time Series Modelling

Time series provides the option to predict future values. Based on historical values, time series can be used to forecast trends in economics, weather, and several other fields. One of the methods available in Python to model and predict future points of a time series is known as SARIMAX, which stands for Seasonal AutoRegressive Integrated Moving Averages with exogenous regressors. Here, we will primarily focus on the ARIMA component, which is used to fit time-series data to better understand and forecast future points in the time series.

When fitting seasonal ARIMA models (and any other models for that matter), it is important to run model diagnostics to ensure that none of the assumptions made by the model have been violated. The plot_diagnostics method allows us to quickly generate model diagnostics and investigate for any unusual behavior.

I started by comparing my predicted values to actual values of the time series, which will help us understand the accuracy of our forecasts. Made extensive use of the packagespandas and statsmodels,showed how to run model diagnostics, as well as how to produce forecasts of the product revenue using time series.

Here are a few other things I tried:

  • More combinations of parameters to observe if it improved the goodness-of-fit of the model.
  • I used theAIC measure to find the best model.

Challenges Faced

Currently, the project requires to predict the product wise revenue for the next 5 years taking into consideration some external factors and definitely the historical data for so that the customer segmentation can be done accordingly. I initially considered to do a random forest for prediction. After some research and based on the model evaluation, it turned out that time series modeling has better advantages. One of the key learnings is that a minimum of 20 data points is required to perform a time series modeling. Main challenges were in preprocessing, choosing the right data model, handling missing values, playing around with pandas data frames, extracting the output from the prediction in python.

Accomplishments

1) A time series model was built using Python to predict the monthly average revenue for the period of 5 years. This will facilitate the Sales team to project their monthly and yearly growth product wise.

2) Data Exploration on the Dataset has provided the Client some insight into Industry and State wise product usage. Insights, findings, and performance of the products were illustrated in a dashboard and presented to the client team. The client in their feedback during the presentation mentioned that this is something new for them as it is been not done by them in the past to look based on Industry and State level.

Current Profile

I am almost more than a year into the Data Science field. At the moment, my current job does not entail a lot of data science, but more of a data analysis and reporting. That being said, in the next couple of months, I aim to fully plunge myself into data science as a career. I got to build my portfolio and showcase my technical competencies.

Strategy for landing in the sexiest job of the 21st Century

  1. Writing blogs

The best thing to promote or showcase yourself is by having a blog presence. I have started my first public blog with this in mind. Another good experience from the MDSI course is that it requires you to blog regularly in their internal CIC around platform and feedbacks are provided by students and staffs which is helpful is in terms of developing your story further.

2. Participate in Data Science Competition

Kaggle is a favorite one-stop destination for datasets, kernels of many data science enthusiasts. Kaggle conducts several challenges with prize money. I have participated in 3 Kaggle competitions so far which is Lending Data Club, US Flight Delay and US Permanent Visa predictions. I had written using R language but I always admire most of the kernels being written in Python. Now post the iLab I can seamlessly write in Python the kernels.

3. Engage with data science community

I have been a very heavy consumer of the blogs, technical forums like stack flow for solutions and meetups. But its time for me to start contributing back to this community by making presentations at this Meetup’s which is another way of getting into the limelight. Be a regular seeder. Not only a Leecher!

4. GitHub Repository

Project Code to be maintained in the Github Repository and well documented. The best developers around the world also have public GitHub repositories. If you had indulged in coding and this provides the technical recruiter an idea on how you approach a problem. This is an excellent platform to store and share your source files.

Wrapping up

Now it’s time to start building a portfolio! I should now have a good idea of what skills to demonstrate in my portfolio, and how to go about building it.