How I Went from Engineer to Data Scientist

Joshua Phuong Le
MITB For All
Published in
10 min readMar 20, 2024
Photo by Miquel Parera on Unsplash

I. INTRODUCTION

Recently I had some catch-up sessions with some of my old colleagues. Some expressed interest in picking up data analytics skills, either to supplement their work, or to start the process of transitioning into this field. Others were just curious how I transitioned from a non-related field to data science. I faced similar challenges and curiosity when starting off from ground-zero — being unsure what to focus on, because there were so many skills mentioned in job advertisements, forums and articles, ranging from database, programming languages, business intelligence to machine learning.

With the background in Chemical Engineering in my undergraduate years, I did not know any data-related skills, nor did I do any programming work during my early career. However, what I did benefit from those years was the problem-solving mindset: to always think of the problem, how to break it down to solvable chunks, and solve them with the right priority.

After just shy of 3 years transitioning to data science, having encountered many rounds of learning and re-learning, I hope to share some personal experiences applying the engineer’s problem-solving mindset in a sequence of challenges that I think any beginner will naturally encounter as they progress in the field (I will use this word “natural” repeatedly for this reason). With this, I hope to provide some insights that may help some of you navigate through many buzz words.

Disclaimer: I don’t aim to solve any specific data science problem in this article. Rather, I hope to highlight the essential groups of skills that a beginner in data science should focus to accelerate their learning. Also I intentionally omit topics that could be non-urgent for beginners such as big data processing and MLOps. They are more advanced and should be picked up once the foundational skills are well developed. If you need more technical advice on the different data science skillsets, please read the excellent article from James below.

To summarize the next section, here are the key points:

Problem 1: How to build up the data analysis foundation with the least learning curve?
Solution 1: Use UI-based software like Excel to train your “sense of data”.

Problem 2: How to make the data analysis steps above reproducible?
Solution 2: Pick up a programming language (like Python) in an interactive environment.

Problem 3: How to handle increasingly complex codes?
Solution 3: Apply good code writing and management practices — OOP, execution environment, Git.

Problem 4: How to track and manage ML experiments and their artifacts?
Solution 4: Use MLflow tracking.

Problem 5: How to scale and collaborate ML workloads?
Solution 5: Migrate to a cloud platform.

II. THE PROBLEM-SOLVING ORIENTED JOURNEY

1. Developing a “Sense” of Data with UI-Based Software

The first problem for me was how to build my foundational knowledge of handling tabular data with as small the learning curve as possible. I wanted to focus on the nature of the skillset, not the tools.

Photo by Rubaitul Azad on Unsplash

With this objective, the first tool that I, and I believe most of us was exposed to when dealing with any form of data analysis is Excel. In fact, my first data analyst internship stint was heavy on excel, and another UI-based tool from SAS called JMP Pro. Personally, I really valued these tools because it greatly helped develop my sense of tabular datasets: how to manipulate them, how to slice, aggregate and pivot different columns and groups to arrive at the business requirements. JMP Pro even had common data science tools like clustering and regressions. They were really easy to pick up and can fill the skill-gap when I was not very comfortable with programming languages like Python, SQL or R.

I always kept in mind the limitations of these tools such as the available “extensions” to use, and the difficulty in repeatability and automation (though you can do that via VBA or SAS’s own scripting languages, but it would have taken me too much time, which was more wisely spent to pick up more popular languages like Python and SQL).

In addition, thanks to the better “sense” of data, I could easily reproduce the manipulation steps above when moving to proper programing languages.

Hence, I would encourage everyone not to shy away from these tools when you just start your career in data. Do well with these tools, provide business impact and carry with you the transferable skills when you transition into programing languages.

2. Write Python Codes in Notebooks for Exploratory Data Analysis

The next problem naturally followed from the first one above: how to reproduce my analyses with Python codes, avoiding the need to clicking through the same steps in the UI.

The tool that I believe most Python learners will use from day 1 is the Jupyter Notebook. This tool is really useful to data analysts thanks to it being a read–eval–print loop (REPL) environment, which takes separate commands and returns the outputs within the same notebook file. By this nature, it supports the exploratory data analysis workloads very well, as you want to see how data or illustrations are changed after each of the cleaning, processing and visualization steps.

To many analysts, using this tool well could be sufficient to carry their day-to-day tasks (so long as they clean up the codes for readability and maintainability). Perhaps you just need to refresh the notebooks or modify the data reading steps in order to get the next batch of results for your business.

Note that I would strongly encourage the use of modern IDE tools like VS Code and PyCharm, even to run notebooks, thanks to their added quality-of-life such as code navigation, assisted formatting, copilot integration, cloud integration extensions, etc.

3. Apply Better Code Writing and Management Practices

a. Object Oriented Programming (OOP) and Codebase Structure

Everything was fine for me until I wanted to write more complex codes and reuse them. The next problem was that these notebooks from #2 above grew out of control, there were too many repeated codes, and it was hard to maintain them.

As many steps could be reused to different objects such as dataframes, lists, etc. So instead of repeating the codes, it is smarter to just put them nicely to external modules in functions and classes so that your main notebook or application file can just call them back and apply on the data.

Because of this motivation, I needed to pick up OOP (to a reasonable level of competency). Also, I needed to learn how modules and packages work, in order to store the functions and classes. Then, naturally, how to organize the codebase into some reasonable structure as what I previously wrote about (link below).

My usual workflow is to use the notebook tool to test any new functionality, move it to an external module after that, and test it again by importing it to the notebook to ensure its behaviors are as expected. Steadily, I refactored all codes from notebook experiments with structured codebases, which was usually the required format of automation tools/agents.

b. Handling Execution Environment

Then, as I had to deal with my own packages, I realized that the versions of other libraries and packages were also important to ensure my codes could reproduce the result without conflicts. This was essentially the idea behind controlling the execution environments of your codes, which are achieved by Conda environments or virtual environments. I also wrote another article in the past on this topic below.

Once you do this, you will realize that writing a pure Python “data application” is something within your reach. You can have a “main” Python file (or notebook) to piece different data manipulation and analysis steps together, where each step can be executed by calling your custom modules/packages and other libraries.

This idea of piecing logical steps together is very useful later when you want to use “machine learning pipelines” concepts in cloud ML platforms like Azure Machine Learning. Similarly, the idea of controlling your execution environment is also crucial when you want to move your workloads to the cloud.

c. Use Code Version Control

Photo by Roman Synkevych on Unsplash

After doing all the hard work above, my next problem was how to save these nicely refactored codes somewhere for future use and references. To solve this problem, Git was the only answer. I learnt how to use Github to perform basic actions like commit, push and pull, through which I could sleep well at night knowing that my codes are stored in a safe place. Moreover, I could start reading others’ codes in their open-source projects or tutorials and accelerate my learning.

Once the skills in this section are picked up, you would have enough “housekeeping” and organization foundational capability to help businesses where there are more matured software development standards.

4. Using Model Tracking Server for Model Experiments

Then comes the most favorite and “shiny” job of a data scientist — developing ML models.

During the early stage of ML model development, I underestimated the challenge of model comparison and reproducibility. I also overestimated my memory to achieve this. I could not be more wrong!

There could be dozens or hundreds of experiment trials, each could have different model types, hyperparameters, objective functions, and even input datasets. The combinations are just endless. Without a good experiment and model management tool, you can easily get lost in your own hard work.

Thus, I would encourage everyone to pick up MLFlow as soon as they pick up any ML model frameworks like SKLearn, XGBoost or deep learning ones. Learn how to manage experiments such as logging your parameters, metrics and input data. Then learn how to manage models from these different frameworks with different “flavors” in MLFlow. You can definitely start with the “local host” MLFlow tracking server, and when you move your workload to the cloud, the migration is usually seamless. You can then enjoy different model deployment methods from these platforms, which usually work best when the models are produced from the MLFlow process.

5. Moving to a Cloud Platform for ML Workloads

After achieving something of a controlled and streamlined data science workflow from the previous sections, the next problem for me was how to automate, scale and collaborate on data science workloads.

Photo by Pero Kalimero on Unsplash

Again, in this day and age, the natural next step was to move my ML workloads to the cloud (I use Azure extensively due to my day job requirements). The greatest benefit in my opinion is the ability to easily provision resources to meet your demands, such as storage, scalable compute (including GPU and Spark), integration with Git providers, and other auxiliary utilities like secret management and container services. In addition, you only pay for what you use, and can easily stop/remove unwanted resources.

Moreover, the MLFlow tracking server is provisioned out of the box in cloud ML platforms like Azure Machine Learning, Amazon Sagemaker, or Databricks. This means you can use (almost) the same MLFlow codes previously in your local tracking server to track and manage experiments and their artifacts. In addition, moving to the cloud makes it easier to share and collaborate on your data science tasks.

Last but not least, these cloud ML platforms offer various model deployment options, which abstract away many hardcore software development and network administration needs from the data scientists. You can easily deploy your model to a REST endpoint for live inference, or as a batch endpoint for asynchronous inference, or even combining several models and other data manipulation steps into a complete ML pipeline. These things can be achieved by a single data scientist, which solves the problem of the lack of developer resources in many companies.

III. ENDING

It has been nothing short of amazing in such a short time when transitioning to the world of data science. My journey above was of course no way near complete as there are so many other things that I want to learn to broaden and deepen my experience. However, I believe that with the foundation above, anyone can accelerate their learning and contributions. Last but not least, the most important elements of a successful transition are the curiosity to learn new things to solve your immediate problems, and the patience to build up your skills from the simplest ones.

Disclaimer: All opinions and interpretations are that of the writer, and not of MITB. I declare that I have full rights to use the contents published here, and nothing is plagiarized. I declare that this article is written by me and not with any generative AI tool such as ChatGPT. I declare that no data privacy policy is breached, and that any data associated with the contents here are obtained legitimately to the best of my knowledge. I agree not to make any changes without first seeking the editors’ approval. Any violations may lead to this article being retracted from the publication.

--

--

Joshua Phuong Le
MITB For All

I’m a data scientist having fun writing about my learning journey. Connect with me at https://www.linkedin.com/in/joshua3112/