Quality assurance in data science
Quality assurance and testing in machine learning systems
Testing is one of those deceptive activities which takes more effort in the short-term, but in the long-term is a huge time-saver. Having a unit or integration test to check a particular step in your data preparation pipeline can potentially save huge amount of time or costs on a machine learning system.
Since most machine learning systems are intertwined with applications, testing and validation is not enough just at the model level. There is a model validation process which every data scientist is aware about, however, there are other aspects that require validation and testing as well.
For each part of the data science process, testing could be introduced. Test harness and automated testing suites can be used to focus on some parts of the process. For a few other parts, it could still be manual in nature. Different types of testing are explained below.
Data quality assurance
As the title reads, this step ensures the data values which are input to the model during validation, training or otherwise are optimal quality. This input data needs to validate against the expected schema and should correlate with the correct version. Wherever data transformation is involved (LSA, One-hot encoding, SVD) or default values are set, there could be automated tests to ensure the values are correct. These tests could also be used to ensure that missing values are handled correctly.
In the ML Ops world, a team can choose various ways to deploy models. Models could be deployed as canary, composite, real-time or A/B test path methodology. These deployment pipelines have in-build testing processes to test the efficacy of these models. These could be automated unit tests or manual tests which contain parts of the training data set (test set) executed against the models. They could be used to check model response times, accuracy of response and other performance parameters. Additionally, the model should be tested on data sets which contain outlier examples which the model may not be trained on. The model should be able to handle such scenarios with relative ease.
While ML model performance is non-deterministic, data scientists should collect and monitor a metrics to evaluate a model’s performance, such as error rates, accuracy, AUC, ROC, confusion matrix, precision and recall. These metrics should be saved and reported on consistently on a monthly/deployment-by-deployment basis. This becomes even more important if the team is deploying models using canary or A/B testing methodology. Performance thresholds should be established which could be used overtime to benchmark models and deployments.
Testing model bias and fairness
Models could be fed with data which could be biased. Machine learning finds patterns in data. ‘AI Bias’ means that it might find the wrong patterns — a system for spotting skin cancer might be paying more attention to whether the photo was taken in a doctor’s office. ML doesn’t ‘understand’ anything — it just looks for patterns in numbers, and if the sample data isn’t representative, the output won’t be either. Meanwhile, the mechanics of ML might make this hard to spot.
While we might get good performance on the overall test and validation datasets, it is also important to check many more data points for a given value of a feature (e.g. race, gender, or region). A tool like Facets can help the team visualize those slices and the distribution of values across the features in your datasets.
System integration testing
This testing should be done after all the testing mentioned previously is done.We can use a similar approach to testing the integration between different services, using automated tests to validate that the expected model interface is compatible with the consuming application. These become important when the ML application is communicating with an external service. Typically such services could be maintained by a different team, they may be subject to slow, and unreliable networks, and maybe unreliable themselves. A failure in any of these automated tests implies you need to update your automated tests and probably your code to take into account the external service changes.
Another type of testing that is relevant when your model in production and your model during development are written in different languages. After deploying to production, both models (and their respective versions) should be validated against the same validation dataset and compared to check results. This could be achieved as part of a post deploy verification as mentioned below or a separate automated test.
Post deploy verification
Post deployment, we want to run a suite of automated test scripts which will be testing the product for pre-determined set of inputs to check if it gives reasonable outputs. This is usually done right after deployment to give a go/no-go decision if the deployment was successful. It is also used to ensure there aren’t any regressions when the newer product/system is deployed. Partially, system integration testing could lead into post deploy verification as mentioned above.
There could be other types of tests which could also be added. However, it is important to add some manual stages into the deployment pipeline, to display information about the model and allow humans to decide if they should be promoted or not. This information could be tied to specific model and code versions which will make the process robust and rollback friendly.
This allows you to model a machine learning governance process and introduce checks for model bias, model fairness, or gather explainability information for humans to understand how the model is behaving. This coupled with the right data science versioning process can build a robust machine learning development pipeline for a data science team.
Subscribe to our Acing Data Science newsletter for more such content.
Thanks for reading! 😊 If you enjoyed it, test how many times can you hit 👏 in 5 seconds. It’s great cardio for your fingers AND will help other people see the story.