Software Engineering for AI/ML/Data Science Projects

3 min readSep 19, 2020

Preface

Artificial Intelligence (AI), Machine Learning (ML), and Data Science (DS) are the buzz words in Information Technology (IT) industry. The majority of enterprises are moving from the Proof of Concept (POC) stage to the production and monetization of AI/ML/DS solutions. Due to the very nature of work involved in the projects team composition, skill requirement, and core AI/ML/DS development is happening slightly different than traditional software development. The involvement of data exploration, data engineering, experimentation, and specialized tools like JupyterNotebook contributes to the complexity. When we move from experiment to production and monetization, it is essential to look at the AI/ML projects in a Software Engineering perspective. The thought process is not something new in the industry. Pioneers in the industry, such as Google and Microsoft, already discussed and published research findings and points of view. The current article series is an attempt to discuss Software Engineering needs for AI/ML in detail.

Due to the increased awareness hype and business importance, AI/ML component gets greater attention in any IT projects. Once the very scientific and religious process of model building is completed, it becomes a minimal component in an IT system. A whole set of IT system components become part of the more comprehensive solution. This is where Software Engineering marries AI/ML.

Notebooks to Deployable Code

Data Scientists always love to build the models in notebooks. Most of the time, a notebook and associated artifacts become a puddle of CSV and DataFrames. Yes! We do not forget about the glue code factor and numerous matplotlib/seaborn plots. The third and most exciting puddle is pickles, a deployment-ready notebook served with pickles!

Different IT roles will see this from a different perspective. Coding standards, hardcoded values, lack of reusable function, absolute zero documentation, and lack of trackability. When an enterprise march towards maturity in AI/ML/DS space, it is essential to set a Software Engineering practice. The question is which persona should be responsible for this, Data Scientist, Data Engineer, or Machine Learning Engineer? What are the high-level focus areas and checkpoints to reduce or eliminate the puddles?

Before we go to the personas, we need to define the areas we need to establish disciplines. High-level focus areas include:

· Clean and Maintainable Code

· Test-Driven Development

· Version Controlled Artifacts

· Data Engineering and Pipelines

· API Design and Deployment Best Practices

· Do’s and Don’ts of Open Source

· Continues Integration/Delivery

· Intellectual Property and AI/ML

· Regulatory Challenges

Prior Work

One of the most notable paper in this paper is by Google engineers entitled ‘’Hidden Technical Debt in Machine Learning System’’ [1]. The article was published in 2015 at the NIPS conference. The 2019 paper by Microsoft engineers is one of the other remarkable works in the area. The paper is entitled ‘’Software engineering for machine learning: a case study’’ [2]. The paper ‘’The emerging role of data scientists on software development teams’’ [3], is also a notable work in discussing the topic. Most of these works were part of academic and limited enterprise discussions. To bring sustainable support for interested enterprises, Gartner published professional advice white papers [4], ‘’Preparing and Architecting for Machine Learning’’.

In recent years, academic institutions focus on teaching software engineering practice to Data Since and Machine Learning students. Some books are in production and published focusing the working professionals. Those books are concentrating on known patterns and anti-patterns found among the practitioner’s community. Machine learning Design Patterns Valliappa Lakshman et.all. It is an excellent reference [5].

What is Cooking?

In the subsequent articles, we will discuss the best practices for each of the focus areas. We will start with the art of writing clean code. We will make this series from a practitioner point of view and an Architecture point of view.

Reference

[1] Hidden Technical Debt in Machine Learning System, https://dl.acm.org/doi/10.5555/2969442.2969519

[2] Software engineering for machine learning: a case study, https://dl.acm.org/doi/10.1109/ICSE-SEIP.2019.00042

[3] The emerging role of data scientists on software development teams, https://dl.acm.org/doi/10.1145/2884781.2884783

[4] Preparing and Architecting for Machine Learning, https://www.gartner.com/en/documents/3573617/preparing-and-architecting-for-machine-learning

[5] Machine learning Design Patterns, https://learning.oreilly.com/library/view/machine-learning-design/9781098115777/

Software Engineering for AI/ML/Data Science Projects

Written by Jaganadh Gopinadhan