Machine learning is a technical process, but it starts and ends with people. The first step to structuring your machine learning project is to consider the people you need to make it happen.
Data scientists are the “x factor” in a machine learning project, representing the main difference between an ML project and other kinds of software development. They create (or select) and train algorithms, building the models for machine learning. Data scientists may specialize in certain kinds of problems or data sets. Some are more research-driven and academically minded, while others are more results and task focused. While both kinds can contribute, it’s preferable for team leaders have a results orientation, rather than a tendency to linger on questions.
Data scientists can grab data, throw together an algorithm, and show that it works. But when they hack together a demo, they may take shortcuts. They may create their solution under idealized assumptions about the data inputs and algorithmic outputs. Integrating and scaling a data scientist’s work into production requires building pipelines for data to go to the right place. In other words, it requires . . .
Engineers are the people that typically put the “In Practice” into “Machine Learning in Practice.” They ensure the technology is built in a robust way so that it can be packaged and deployed into production in mission-critical systems. Engineers help with software development, best practices, and data wrangling/pre-processing. They might set the infrastructure, get the data pipeline in place, and ensure the data scientists have everything they need to focus on the models.
A machine learning algorithm often ends up as a single function in a working piece of software. Software engineers think about how to maintain the software over time, making the algorithm robust for the real world. Data scientists are often not classically trained programmers, and so they may not have the best practices when it comes to creating reusable code.
Engineers also help ensure the entire effort is not beholden to the one expert who worked on the project. They make sure the software does appropriate levels of logging, that it can be monitored, that there is proper documentation, and that other software best practices are followed. This allows data scientists to build the best algorithm possible. Without engineers, data scientists may be stuck just doing cool demos.
A “unicorn” developer is one that can do both leading data science and implement software best practices. It’s rare to find one person that can handle all of these responsibilities.
Note that while it is preferable to involve engineers from the beginning of a project, the full engineering need may be difficult to know during early planning stages. Engineers are most necessary during the later deployment stage (see below), at which point the project may have evolved significantly.
This role is focused on wrangling/pre-processing data to prepare it for machine learning.
Text data, for example, may come in as a PDF, a .docx file, a .txt file, or a string from a database. A data engineer may need to convert the text to the proper format for a given programming language. After the text is loaded, more pre-processing might be required due to language, length, word frequency, and word variation.
It’s an iterative process: if an algorithm isn’t working, a data engineer will try finessing the data in a different way. After a project goes into production, data engineers may work together with software engineers and data scientists to assess and optimize the data feeding back into the model.
The need for this role will vary with projects and team size. In cases where a lot of pre-processing is required and senior data scientists are better focused elsewhere, more junior data experts could take on this role. In other cases, data scientists may take on the data prep themselves.
Project Roles During Project Phases
Data Processing Phase
Data Engineers wrangle the data while Data Scientists offer guidance (and do data engineering as needed).
Algorithm Development Phase
Data Scientists create algorithms while Software Engineers offer help and guidance.
Solution Deployment Phase
Software Engineers implement the solution while Data Scientists tweak algorithms as needed.
Beyond the core technical team, more roles are crucial for achieving business impact in a machine learning:
The Business Owner is responsible for focusing the team on the core business problem to be solved and for providing business support via the budget. Business owners may also need to communicate the unique attributes of machine learning projects to the leaders above them, including the potential ebb and flow of project timelines. The business owner’s biggest challenge is often understanding what is actually possible with machine learning. Making promises without a firm understanding can hurt a project.
The Technical Lead is responsible for the overall architecture of the solution. They know how to put all the necessary data sources and infrastructure elements together to complete a project. While this person should have some technical understanding, they may not be a machine learning authority.
The Project Manager structures and guides the project, interfaces with stakeholders, adheres to standard procedures, keep projects on time and on budget, reuses lessons and technologies from past projects, and appropriately documents the endeavor and business results.
A given ML project may see different individuals in each of these positions. But it’s also possible for one person to hold multiple roles (a technical team member may take on project management tasks, a data scientist may do data engineering). Also, a machine learning team may have members from different companies: business owners will likely be internal employees, but data scientists and engineers may come from external partners.
James Kotecki is the Director of Marketing & Communications at Infinia ML, a team of data scientists, engineers, and business experts putting machine learning to work.