Declare your biases

Rama Akkiraju
6 min readFeb 20, 2018

--

Photo by Joshua Earle on Unsplash

This month there has been a lot of discussion on biases in Machine Learning models in AI world spurred by this New York Times article on bias in AI systems. Every time the topic of machine learning models and biases comes up, invariably everyone points out the importance of training data and how important it is to ensure that the training data is representative and unbiased. While that is an important point to note, I’m afraid, it is not very actionable for practitioners who are building machine learning models, if we don’t provide any prescriptive guidance on how to ensure that the training data is ‘representative’ and ‘unbiased’. How can a data scientist building a machine learning model ensure that the training data she is working with is unbiased? What does it mean to be ‘unbiased’ anyway? ‘unbiased’ in what ‘scope’? representative of what ‘scope’? Who defined that ‘scope’ for her? How to measure that ‘scope’? She needs some guidance and tools to answer these questions.

I’d like to make two points on this topic in this article:

First, I’d argue that there is no such a thing as an ‘unbiased’ Machine Learning model. So, instead of striving for unbiased machine learning models, a machine learning model must state its biases openly.

Second, minimizing biases has to start with creating test datasets rather than with training datasets.

Let me elaborate.

State your biases: If one were to attempt collecting unbiased training datasets in any particular domain to build out a robust machine learning model, ideally, one has to collect enough representative samples of data from that domain that the model is trying to learn. How does one go about modeling that domain and mapping out its contours so that you can sample enough data from the space that the domain represents? Physicists, Mathematicians and Statisticians like to explain the phenomena in our world from the point of view of models. Models are good tools to explain what’s happening around us in general terms, if not at every specific occurrence of a phenomenon. Distributions such as Gaussian, lognormal, exponential, Laplace, and Gamma etc. are used often to represent the occurrences in real-world. They are approximations but serve a good purpose to help reason things. So, we have tools at our disposal to figure out which distribution might be a good approximation to the domain that we are trying to model. Once the distribution is identified, we can use tools again to see whether we have enough samples to represent that model of the world reasonably or not. However, here lies the problem. More often than not, the domains are not evenly distributed, meaning not all phenomenon occur at the same frequency. So, it is hard to observe certain phenomenon than certain other phenomenon because they occur less frequently. Therefore, a corollary to this is that it is much difficult to collect certain types of data than certain others because there aren’t simply enough of them to around. During data acquisition process, organizations have to deal with budget, and time constraints. Rarely do organization have unlimited budgets and time to collect representative samples to collect most comprehensive datasets that can avoid biases completely. One can, at best, mitigate biases with careful planning (I will discuss this in the next point). Therefore, I’d argue that it is more practical for a machine learning model to declare its biases than to pretend that it is unbiased or that it can ever be fully unbiased. How can we do this? Well, one way to do this is to be open about the scope, coverage, type of data and the sources of data that a model is trained on. I know this gets into revealing too much about ones’ secret sauce to the whole world. Organizations don’t like to do this, usually for good reasons. However, here is an analogy that might help us rationalize this. When a new drug is released to the market, Federal Drug Administration (FDA) (or whatever the analogous body in a different country) mandates that the ingredients used in making that medicine be declared on the drug label. Drug companies would rather not do it if they can help it but it helps the patients understand what they are getting. May be machine learning models ought to be treated like these new medicines released to market. If we tell the users what the models are trained on, and the innate biases, they know what to expect and won’t hold it accountable for something that it is not trained on!

Start with test datasets: In software engineering, after many years of trial and error and iterations, best practices evolved on how to build robust software with minimal defects. One such best practice states that one must start with building test cases for the software first before writing any code. Once the expected behavior of the software is defined by means of detailed requirements, business analysts must write the test cases. Developers, then, are supposed to write corresponding unit and system test cases first. Software is then designed and developed to meet those requirements and to pass the test cases. Passing the test cases is how one measures whether the written software meets the requirements or not. Test case coverage is a very important software development metric in building good quality software systems. When building large commercial software, teams of software testers are employed to write test cases and to test the software from all angles. It seems that when building machine learning models, somehow we have forgotten the basic principles of software development. The onus is often on data scientists building the machine learning models to ensure that they train the model with ‘unbiased’ data. Whatever happened to writing test cases first in the world of building machine learning models? Whatever happened to creating a test team? A test dataset in machine learning world can be thought of as a test case in traditional software development. Just as test teams are an integral part of a software development organization, test teams that create test datasets should be an integral part of AI systems development organizations. Similar to ‘test coverage’ metrics in software engineering, we need to define, measure and monitor ‘test coverage’ for machine learning models. Leaving this to data scientists who build machine learning models is not enough. A good software project never relies on developer test cases alone to release commercial software. It must be tested by independent testers.

Once the test datasets are created in machine learning world, it is the job of data managers (please refer to my previous article on new roles in machine learning systems for a definition on who a data manager is) to collect training data that has the desired ‘coverage’ to train a machine learning model. At this point, we need good metrics, algorithms and tools to measure various aspects of training data sets and test data sets to note how close they are to each other, what the gaps are and in what areas the gaps are. Unsupervised machine learning algorithms themselves (such as topic modeling and clustering) can be put to use here to measure the distance between train and test datasets on various dimensions and to understand where gaps need to be bridged in training datasets. Based on the noted gaps, data managers can iterate till they reach a threshold of desired coverage or a threshold of distance between test and train datasets in terms of coverage. Clearly, these are ideas and concepts at this time. We need to drive more research work on these topics to build the methods, processes, and tools to institutionalize this type of disciplined process to building train and test datasets.

But wait a minute! I haven’t still addressed the original problem of creating test datasets that cover the scope of the machine learning model. I simply argued for distributing the problem of creating test datasets to multiple people (which is a good start anyway! Diversity ensures good coverage often). We still need a good ‘coverage’ measurement metric to measure the distance between the scope covered in the test datasets collected and that of ‘requirements’. Okay, so now we are on to a different point. How to represent requirements for a machine learning model? Well, these are all the critical questions we must ask and answer for ourselves. I don’t have answers to all these yet but I’m certainly thinking about these and I know several of you are as well.

One thing is clear to me though! We need to be more methodical about building machine learning models than we currently are. There is a lot to learn from software engineering practices, and quality management in manufacturing. We don’t need to reinvent those wheels. We need to find suitable interpretations here so we can build on those best practices.

--

--