Sitemap
TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Follow publication

3 Areas Missing From Data Science Courses You Should Know

--

Credits Richard Freeman, PhD

With a PhD in machine learning and 18 years applied experience, I get asked by aspirational data scientists what they can do to upskill. Given I act as a mentor/advisor, CTO and knowledge of big data and machine learning engineering, AWS cloud computing, and natural language processing (NLP), I wanted to share my learnings and insights to a wider audience.

The initial thing that I see is that they would focus only on courses or tutorials with toy data problems with ML / NLP / Deep Learning (DL) models only which is a good start. However insufficient, as there are three important areas that you won’t typically find in data science courses or in some university degrees. The reason might be that they are harder to teach and need industry experience. Here I’m not talking about clean data science tutorials or research problems but typical real-world business data science problems most organisations have.

Data Preparation

You will often hear that data science is the sexiest job of the 21st century but you hear less about the 80% of a data scientist time being spent on data preparation — the conveniently and often the forgotten side as it’s hard to market and hype it, yet is is an essential foundations that has to be done.

What is data preparation, well it could be anything from scraping, cleaning, normalising, joining, filling missing data, vectorizing, pivoting, labelling etc. the data sets. this is important as the whole point of using ML over rule-based algorithms is that you train/test it based on the data. Without clean data even the most advanced DL models will not perform.

One issue is that datasets out there are generally already clean and with a train/test set so gives an illusion that you can simply apply you ML/NLP/DL models directly with minimum preparation. In the real world the data is very dirty, need to be joined, enriched, or composite features need to be created.

What you can do ?

Become an expert at doing data preparation with Pandas, and pySpark if you want to scale. Why give the power and dependency to fix away to someone else?

If you also look at what data analyst do, they clean the data in Excel, SQL or ETL tooling to create analytics dashboard, it’s a similar task that you want to do but ideally in code and at scale. Sometimes you can for example use rule-based methods or NLP to clean names, address, or dates. This will need to be done whether you want to do analytics reports for a clients, or use is train/test/validation set in a data science models.

Don’t underestimate the power of SQL in your toolkit, where you can easily join tables and not rely others to query and extract the data. I find that many times it will be quicker for me to directly query Amazon Redshift using SQL than write some Python code.

Domain knowledge

Although many of the skills of a data scientist are portable to any industry sector, you will still need to acquire industry specific knowledge to be effective.

Imagine for example that you want someone to create a data science model that predicts stock market prices, and that you put down as investment a year worth of salary so that a model decides which stocks to buy/sell/short. Now imagine that you asked a junior data scientist to do this and they don’t have any experience in investment banking or financial services, how confident would you be that the model will predict the best stocks to buy given you annual salary is at risk?

This is why without domain knowledge your models will mostly likely not be accurate or make sense. How will you be able to validate/measure it? Do you really know what your customers/users want? Sometimes they will have rule-based algorithms that could perform better and be much more explainable that can serve as a baseline for what your do in your inferences.

You will also see that unless you are a research organisation, the main thing the business focus on is return on investment (ROI): in my view this is basically to save time, save money or make money. Then map these back to the data science powered inferences you are making. I explain some of this and when you should NOT use AI and use it based on my experience.

Technology and data science is one of the novel enablers for those ROI metrics. Unless you have a research focus, most companies won’t mind if you are using the latest deep learning or reinforcement learning, and would privilege a pre-trained or simpler model that goes to market quicker and generates an ROI, especially when costs can be high when you use large scale GPU servers and cloud computing for training.

What can I do?

In my view you should be passionate about the area/domain you work in, this will help you also become an expert. When you select an industry dive into the domain knowledge this applies to FinTech, IsureTech, AdTech, AgroTech, MedTech, PropTech look at existing data models, customer requirements, terminology, and the application of analytics / data science.

Mine is healthcare as I think it has a massive positive impact on humanity and the use cases are wide open and challenging.

Ways to acquire the skills is working with a domain expert, reading lots of domain specific content, and gain experience dealing with industry specific data.

Software Engineering

Now that you have the data preparation and domain knowledge the last pillar is software engineering.

Well done you can execute sample scikit-learn and Keras code locally on toy data in a Jupyter notebook! This is useful for insights, exploratory data analysis and presentations, but the true ROI is when you move your models into a pipeline, product or service used directly by customers or users. For example say a user visits your website, you will make a real-time recommendation of the most relevant content that will lead to higher user retention, higher click through rate or purchases. To do this properly you need to write, test and deploy your code and model like good developers do. Your code will have to scale.

For example when I was leading the data science team at JustGiving we had models that made real-time predictions on 26M user, the code was deployed as a service on a highly available and scalable architecture with ultra low latency.

When you operationalise ML models there are many elements that come into play like the APIs, model versioning, model accuracy monitoring, feature stores, data flows in/out etc. Also if you are after a minimum viable product (MVP) you will most likely use a pre-trained models and/or existing inference APIs for which will need development skills.

What can I do?

Especially if you are not from a computer science background your will have limited exposure to development or software engineering, if you have only directly done data science courses. Work with good developers and upskill yourself with developer skills including testing, design patterns, Docker, cloud platforms like AWS, and CI/CD. A lot of the core skills are related to data and machine learning engineering — for me those are the people in short supply NOT data scientist. The reason I think is that the barrier to entry would be hard for someone not from a developer or computer science background. Either way the better your code is, the quicker you can prepare the data or solve actual use cases.

Summary

In my view the data science field is so popular, as it seems to be open to all and sexy, but when you dig deeper and work in the real-world you will see that you need extensive data preparation skills, domain knowledge and software engineering skills that are often forgotten but needed. Working on these three pillars will make you a better data scientist, especially as the field matures, use cases become more complex and you work in teams. To learn more about my views have a read of my other post like Recommendations for Working in Data Science, AI and Big Data Based on my Personal Experience. Good luck and message me if you have questions or comments.

--

--

TDS Archive
TDS Archive

Published in TDS Archive

An archive of data science, data analytics, data engineering, machine learning, and artificial intelligence writing from the former Towards Data Science Medium publication.

Dr Richard Freeman
Dr Richard Freeman

Written by Dr Richard Freeman

Author, Advisor, Co-founder & CTO @ Vamstar, Series-A funded startup Tech4good enthusiast

No responses yet