Authors Note: This piece was originally published by the author in November 2020 on translatingnerd.com

You want to impress future employers by creating a dope end-to-end machine learning project, fully kitted out with a web scraping data collection strategy, deep dive exploratory phase, followed by a sick feature engineering strategy, coupled with a stacked-ensemble method as the engine, polished off with a sleek front-end built microservice fully deployed on the cloud. You have a plan, you have the boot camp/degree program under your belt, and you have $100 of AWS credits saved from that last Udacity course. You fire up your laptop, spin up Jupyter Notebook on local mode, and log into AWS. You then draw a blank. A blinking cursor in your notebook next to “[Ln] 1:”. The coding log-jam is akin to writer’s block, the proverbial trap from following too many MOOCs, a deep hole of despair. What do you do?

Entering the wilderness

Photo: https://unsplash.com/@the_bracketeer

When venturing out into the wild beyond official coursework, boot camp code-along, and the tutorials of Massive Online Open Courses (MOOCs), taking that first step into uncharted territory when creating your first end-to-end project can be scary. Much of the time, it is difficult to know where to start. When re-skilling, up-tooling, or revamping our way into data science, we tend to get distracted by the latest and greatest in algorithmic development. End-to-end machine learning projects rarely leverage the most complicated algorithm in academia. In fact, many of the machine learning ecosystems in development these days in major companies around the world are slight deviations from the tried-and-true approach of what we see as “standard data science” pipelines. So why should you put pressure on yourself to apply the latest cutting-edge ML algorithm when you are just starting out?

Knowing how to leverage data science tutorials is your first step

All about tutorials

The approach that I like to take when learning something new and wanting to try it on my own use case normally follows a three-step pattern:

  1. Find a tutorial
  2. Follow said tutorial
  3. Re-follow tutorial with your own data
  4. Customize your pipeline with your own bling
A tutorial in a tutorial found in the wild. Double meta!

Excellent places to start for tutorials include:

Complete the tutorial

Follow along with the tutorial. Many times YouTube is a great place to really get to know the flow and how to interact with the tech stack and dataset that you are using. I always like to find video tutorials that have accompanying blogs as well. Much of the time, the author will guide you through the data science pipeline, while referring to the documentation that they created in the blog. AWS does a fantastic job of this, especially in regard to their curated videos that follow SageMaker examples.

Change the data

How you feel after swapping out the data

Once you feel comfortable with the data science approach that has been taught and are able to understand all the code and dataset particulars, it is time to bring in your own data. Your data should mirror the data that the tutorial is using. For example, if you are bringing in data about churn prediction when a tutorial is a regression-based approach, then you should rethink your target variable strategy. Try to find data that fits the algorithm family that you are working with. Classification should go with classification outcomes, regression with regression, and the same for unsupervised learning problems.

Plus it up and go into the wild

Unleash the beast! credit

You are now at the point where you can begin adding a custom flavor to the pipeline. You have already succeeded in bringing your own data, now it is time to put the pieces together into a true end-to-end project. If the tutorial that you are following only moves from EDA (exploratory data analysis) to evaluation criteria for the machine learning algorithm predictions, or maybe you are learning the front-end component of deploying flask or Django on EC2, then this is the perfect opportunity to spice things up!

Did someone say “plus it up”?

Try to think about what end-to-end really means. Where does the data come from? How is it collected? Can you track down an API to bring in the data on a schedule? If no API exists, can you scrape it? Can you automate that scrape with a chron job? Once the data is in, can you write functions that perform the EDA sections for you to automate the output? Can you do a deeper dive and create a story around the EDA that you are digging into? Once you have created your EDA, what feature creation can you do? Can you bring in another dataset and stitch those sources together? In other words, can you add a data engineering component?

As you can see, no matter the tutorial that you are following, there are always areas for improvement. There are always ways that you can get your feet wet and then really take off with your own touch. Once you have created this pipeline, think of the story that you want to tell about it. Is this something that I can talk about in a future interview? Is this something that I need to communicate to non-technical members of my team back at work to show that I am ready for sponsorship to the next level? How can I tell the story of what I have done? Can I write a blog piece about this and share it with the world?

One final thought

Throughout the data science process, we learn from others. As always, properly accredit those that came before you and were influential in your work. If you followed a tutorial to really help a challenging section, include that tutorial link in your notebook. Just as we try to move our careers and curiosities forward, it is paramount that we give a bow to those that conducted the science before us.

--

--

Nicholas Beaudoin

Nicholas is an accomplished data scientist with 10 years in federal and commercial consulting practice. He specializes in ML operations (MLOps).