How I switched from Software Engineering to Data Science …

Few useful practical tips

Rajesh Srivastava
8 min readMar 16, 2023
Machine Learning Pipelines
Image by Author

Transitioning from software engineering to data science can be both challenging and rewarding.

While I deeply appreciate the deterministic nature of software engineering tasks and still get involved in them frequently, data science tasks carry a higher risk of failure due to a range of external factors in addition to individual skill levels (Take an example of the debacle at Zillow offers in the year 2021 due to AI algorithm). Ultimately, the decision to switch positions is a subjective one and depends entirely on an individual’s preferences and career goals.

Making the transition from software engineering to data science may require additional education or training in subjects such as statistics, mathematics, and programming languages like Python or R.

It’s understandable to feel intimidated by topics like probability, linear algebra, calculus, statistics, etc, particularly for those who haven’t studied them in years or had the opportunity to apply them in a practical setting. However, with dedication and effort, individuals can acquire the necessary skills and knowledge to succeed in data science.

For those with a strong software engineering foundation, the transition can be smoother as they already have the skills to develop complex algorithms, manage large datasets, and create scalable solutions.

Through my years of experience working in full-stack software architecture and development, I’ve discovered that many skills required in software engineering are also incredibly valuable in a full-stack data science role.

A few years ago, when I began my journey in Data Science, I decided to enroll in a popular Machine Learning course offered on Coursera, which was taught by Andrew Ng. As I worked through the course, I quickly realized how amazing this field truly was.

The way the course was structured and taught really inspired me, especially as a beginner. One of the most encouraging aspects of the course was Andrew’s approach to teaching, which he often summed up with the phrase,

“Don’t worry about it if you don’t understand it.” 😀

This phrase helped me to stay motivated and not get discouraged when faced with difficult concepts or challenges. It also reinforced the idea that Data Science is a continuous learning process, and that it’s okay to make mistakes and learn from them. Overall, the course helped me to develop a passion for Data Science, and I’m grateful for the knowledge and skills that I gained from it.

I used to think he didn’t want to intimidate new learners with the burden of mathematical concepts, which I guessed were must-know topics to enter into a Data Science role. But I learned his emphasis the hard way when I got stuck trying to understand the few complex concepts of those topics for a long time and couldn’t make much progress initially. In this and subsequent articles, I’ll share my insights on how to effectively overcome that.

To structure my learning and implementation process, I broke it down into Data Science generic steps and tackled each step one at a time. Also, I set a time-bound goal of 10–12 months to stay focused and avoid procrastination. Anything that helped me a lot was reading blogs/Kaggle use cases/watching youtube then practicing and repeating the same cycle. Persistence and perseverance are key.

The steps that I followed in my Data Science journey are condensed into the following topics. While I’d love to cover all of them in detail in a single article, doing so would make it too crowded and difficult to follow.

Therefore, I plan on writing separate articles for each topic that will include technical details, resources for learning, and tips and tricks to help you along the way. Stay tuned for more insights and information!

1. Programming Fundamentals

1.1 Learn and be good at least in these Python libraries: Pandas, NumPy, Scikit-learn, SciPy, PyTorch (for DL), Core Python, Matplotlib, Seaborn, etc.)

1.2 Learn PySpark (very important when handling Big Data and in real-time use cases) and SQL.

In my opinion, Software Engineers have an advantage in learning these topics. With the exception of Data Science libraries, most other topics are related to generic software engineering. This makes it one of the easiest and quickest steps to move on to the next level.

2. Maths & Statistics

2.1 Mostly Pre University level Linear Algebra, Probability, and Some Calculus

2.2 Statistics

2.2.1 Hypothesis testing (p-value)

2.2.2 Normal Distribution, Binomial Distribution, Bernoulli Distribution, Poisson Distribution

2.2.3 Probability Density Function (PDF), Cumulative Distribution Function (CDF)

2.2.4 Standard Deviation & Variance, Percentiles, Quartiles, and Inter Quartile Range (IQR)

2.2.5 Measure of Central Tendency (Mean, Median, Mode)

2.2.6 Covariance & Correlation

2.2.7 Central Limit Theorem

2.2.8 Conditional Probability & Bayes Theorem

The topics covered in this step are crucial as they teach you to think like a Data Scientist. They form the foundation of Data Science and Machine Learning tasks. Personally, I found it easier to understand the topics than to apply them in practical steps. It takes time to get into the groove, but once you do, there will be no stopping you.

3. Data Engineering Skills

3.1 Handling Streaming and Batch data

3.2 Understanding of ETL tools based on their uses (like Glue, Athena, Step Functions, etc in the case of AWS cloud as infrastructure)

3.3 Handling Data ingestion, Data cleaning, Data pre-processing, and Big Data

3.4 Essentials of Data lakes (e.g. S3, ADLS, etc), and Data warehouse ( Snowflakes, Amazon Redshift)

In general, software engineers probably have a basic understanding of the topics mentioned above. However, it’s crucial to understand that working with data is a major component of a Data Scientist’s job, and they often spend a significant amount of their time on it & Exploratory Data Analysis(mostly even up to 60–70% of a project’s timeline).

In reality, it’s common to encounter messy data that lacks clear sources or data dictionaries. As a result, Data Scientists spend a lot of their time trying to make sense of these complex and sometimes confusing data sets.

4. Exploratory Data Analysis

4.1 Imputing Missing Data, Dealing with Imbalanced Data

4.2 Dimensionality Reduction (PCA, t-SNE, LDA, etc)

4.3 Visualization — Box plot, Scattered plot, Histogram, Heatmap, Bar, Pie charts, etc)

4.4 Handling Outliers, Binning, Transforming, Encoding, Scaling, Normalisation, and Shuffling

4.5 Feature engineering (including the few above steps) and feature scaling

As discussed earlier, this step is a critical component of any Data Science project, alongside Data Engineering. It serves as the foundation for the project and is essential to complete before moving on to the ML Modelling phase.

5. Model development

5.1 The intuition of different ML Algorithms and their Implementation

5.1.1 Supervised — Linear Regression, Logistics Regression, Support Vector Machine, Decision Tree, Random Forest, Ensemble learning(Bagging, Boosting (Popular XGBoost, LGBoost, etc), Stacking), etc

5.1.2 Unsupervised — K-means clustering, k-nearest neighbors (KNN), Hierarchal clustering, Principle Component Analysis (PCA), Anomaly detection, Neural Networks, etc

5.1.3 Reinforcement Learning — Conceptual understanding

5.2 Hyperparameter tuning

5.3 Bias & Variance trade-off

5.4 Model evaluation KPIs like Confusion Matrix, Precision, Recall, F1 score, RUC-AOC curve, RMSE, MSE, MAPE, R square, etc

5.5 Model Experimentation and auditability.

This part of Data Science is often the most intriguing to beginners and draws a lot of interest. To effectively work in this field, it’s important to have a solid understanding of the basic machine learning algorithms and the intuition behind them.

After you get the basics, dive into the math behind these algorithms for a deeper understanding.

It’s also crucial to familiarize yourself with popular Python Data Science libraries and learn how to implement them effectively. In future articles, we will explore these topics in more detail.

6. Model Deployment

6.1 Model Deployment in production

6.2 MLOPs (ML Pipelines and CI/CD pipelines)

6.3 Live & Batch Inference

Deploying Machine Learning models in a production environment is a crucial skill for full-stack Data Science roles. With companies increasingly using Machine Learning to tackle key business challenges, even smaller organizations are seeking this expertise.

This area is where Software Engineers typically excel, and it’s becoming increasingly important as MLOps (Machine Learning Operations), similar to DevOps, gains popularity. MLOps separate Core Data Scientists from Full Stack Data Scientists, as it requires a different set of skills and knowledge.

7. Model Monitoring

7.1 Monitoring for Data, Concept Drift, Model Performance Drift, Feature & Bias Drift,

7.2 A/B Testing

As previously mentioned, this step is also an essential part of MLOps and is crucial for monitoring the model and data lineage in real time. This helps to detect any model or data drift, and if necessary, the model can be retrained effectively, or a notification can be sent to the relevant team for further resolution.

By continuously monitoring the model and data in real-time, it’s possible to ensure that the model remains accurate and effective and that any changes or issues are identified and addressed promptly.

8. Business Acumen & Domain Knowledge

One of the key differences between Software Engineers and Data Scientists is their ability to understand the domain of use cases and communicate effectively with different stakeholders.

A good Data Scientist must not only possess strong technical skills but also be able to comprehend the business context of their work and explain their findings to non-technical stakeholders.

This skill is what sets exceptional Data Scientists apart from average ones. Effective communication ensures that the insights derived from data analysis can be put into practice and create value for the organization.

Conclusion

The content of Full Stack Data Science roles may seem difficult to understand when you first encounter them. However, if you approach them systematically, you can make the learning process much more manageable. It’s important to remember that some of the steps in this process involve Data Engineering, Model Deployment, and Model Monitoring, which are typical components of Full Stack Data Science roles.

Therefore, it’s highly recommended that you focus on MLOps since it is a strength of Software Engineers and is becoming more common in Full Stack Data Science roles. By doing so, you will be well-equipped to succeed in these roles and stay ahead of the curve in this rapidly evolving field

Happy Learning !!

If you found this article helpful, please consider clapping and sharing it with others who might benefit from it. Your support can help this content reach the relevant audience and provide them with valuable insights and knowledge🙂

P.S. — The experience I am sharing here and in subsequent articles is solely based on my journey. I started with naïve use cases and later worked on multiple business-critical end-to-end machine-learning solutions with larger clients. Here, I am covering data science/ML tasks that are important for most of the customers’ pain points use cases where deep learning and large language models may not be required. Although deep learning, reinforcement learning, and the now famous large language models require separate discussions in later articles.

--

--