Me Asking What the Data Means (and Asking You to Ask Too)


In writing this post, I’ve created several drafts going in several directions. I wanted to rant about my perception of machine learning culture and how the excitement and (objectively fantastic) accessibility to sophisticated deep learning models is pulling us away from the real problems we’re capable of solving. I wanted to write about being inclusive in the way that I feel is right — by promoting data science fundamentals to new comers and moving away from esoteric talk of the flashiest models that could potentially make those not working on model development feel like their work isn’t intellectually valuable. I wanted to propose a structure that incentivized the sharing of knowledge to solve the plethora of data challenges everywhere, including developing countries. However, all of this would require me to actually know the machine learning community beyond the researchers I follow on twitter and chats I have with labmates, and it would require me to have the answers to the distribution problem of data experts to places they’re severely needed. I don’t have the answers, and I’d be lying if I said I knew everyone’s intentions and goals in the community at large. So instead, I’ll share an anecdote that will hopefully creep back into your mind while cracking open a fresh dataset.

In the field with the AI Research Lab

I first arrived at the Artificial Intelligence Research Lab at Makerere University in June 2017 with most of my machine learning experience from classes or standard Silicon Valley type internships. The AI Lab had been working on identifying diseases in crops using the images of leaves taken in the field by farmers. These images were then to be analyzed by the appropriate machine learning model and diagnosed with disease type and severity. In my first days, to familiarize myself with the data, Dr. Ernest Mwebaze gave me access to a dataset roughly separating images into “good” and “bad” folders since there is a variance in the quality of an image and filtering out the bad images could save a lot of time and money. This problem appeared to be right in the domain of the current convolutional neural network (CNN) models that have revolutionized the field of computer vision since the emergence of faster and faster computational resources, and this was definitely my first response. After slowing down from lackluster results, I backtracked, inspected the data, and found myself in a much more obscure world of problems.

As far as I know, data isn’t really the product of spontaneous events blessing us with magical comma-separated value files containing the answers that our models are eager to partition with scalpel precision. Data is produced by something or someone. Data is the sprinkling of information left for us to piece together some semblance of meaning, and despite the “elegance” of our latest algorithms, our computers are still as brutish as ever. One mantra worth the chanting is “my model is only as good as my data.” In the case of our crop images, farmers without much photography experience were tasked with providing us with consistent images that optimize our abilities to distinguish a healthy crop from a disease afflicted one. I’m being facetious when I mention a farmer’s photography skills, but from a data collection standpoint, clarity and consistency are extremely important. A dataset is being created to get closer to the question of a crop’s health, and along with that, there are many obstacles.

With each step down the pipeline from data production to decision, abstractions are made, and in our case, information is lost. I’ve provided an example of a cassava leaf that exhibits great detail and can stand as a representative example for the type of image on which I’d like to train a model. Again, I can’t stress enough how important it is to keep in mind what information actually means with respect to that goal. Overall, the goal is to record and map the health of crops in a region. In this situation, we are asking a single leaf to represent the health of an entire plant. This already allows for some choice on the part of the farmer. How did the farmer choose this leaf? Why from that specific plant and is this actually representative? Assumptions are being made. From our perspective, we need to consider our battles and having a clear, geotagged image of a leaf is already a great win, and as a data scientist, I make note of potential pitfalls from unrepresentative leaf selection.

“Good” image providing clear surface area with the subject as the focal point

While I think it’s important to keep in mind that I am in no way an agriculture expert, I also think I have some authority to say that the following example of a “bad” image does not tell me the most about the health of the plant, or at the very least, it’s not consistent with the centered, focal leaf model that was established with most of the photos similar to the “good” photo. Master’s student Daniel Mutembesa at Makerere University leads the project collecting this data and he’s put thought into this process, but there are still a lot of free variables that are considered on the data analysis side. The water gets a lot muddier between our beautiful “good” picture and vague “bad” picture, which forced me to think about my own definitions and language used with respect to these definitions.

Generally “bad” image.

What we now have is a binary classification problem between an inclusionary (“good” images) and exclusion group (“bad” images), and as we can see from our two quick examples of bad, exclusion from the group doesn’t imply homogeneity within the group. The two bad images are bad for different reasons. What’s worse is that there are images that toe this line much more closely than I’d like. What do we do with an image that has the proper framing but is horribly lit or blurry? What if it’s an inconsistent angle or contains multiple leaves? Now it can be argued that a robust neural network will take care of these issues, but this is just step one. I’m barely making a model to say whether these images are in or out. I haven’t even begun the diagnosis.

Blurry Image: wouldn’t cut it for someone making a diagnosis.

At this point, clarity was much desired. A good image is an image that has a central foreground object (leaf) that, if presented to an expert, would provide enough insight to make a visual diagnosis. To me this meant that an image roughly focused on a single leaf could provide enough clear surface area to the camera that, hopefully, the algorithm could have some consistency and minimize variance between images and isolate the variable of disease and not worry about everything else changing in the image. I don’t want images of a leaf with and without a disease to look vastly different save the condition and color of their leaves. As for bad images, I split these into several categories of their own. Some were blurry, some were, what I called, busy, and some were just a pictures of a man smiling in front of his crops.

I notice myself talking about this neural network model as if I know how it’s going to behave, and I probably don’t, but I do know the model isn’t magic. I can’t just feed it two datasets with binary labels and expect it to read my mind and generalize to distinguish images on their quality for later diagnosis. I need to spoon feed this information as much as I possibly can. From the perspective of the model, it’s being handed two large sets of images and being told to figure out what makes these groups different then apply that logic to a new set of images. That is it. It’s up to me, the data scientist, to make sure I make this decision as easy as possible.

This maybe sounds like a laborious example of me looking at a bunch of photos and figuring out what makes some better than others, and trust me, it was laborious. I spent at least two days flipping through thousands of images, sorting them with my personal decision policy. Maybe that wasn’t the best way, but the model is only as good as the data. It’s up to the data scientists to define their goals and think about the flow of information signals and how to diminish noise in real world problems. The stakes here were relatively low because this system was primarily built for filtering and assisting future models, but that’s not always the case.

While working at the the Government Laboratory at Universidad de Adolfo Ibañez in Chile, I had the chance to talk to Professor Rhema Vaithianathan, the lead researcher implementing the new system to flag children at risk for child abuse in Pittsburgh, Pennsylvania. The basic idea is that there currently exists a system to connect a limited resource, family social workers, with families that will benefit most from their services and mitigate child abuse. The factors considered in this decision range from health data to family data across multiple government sectors. I encourage you to read Dan Hurley’s piece in the New York Times Magazine on the subject for much more thorough context.

This model may also be built on machine learning techniques, but the stakes are quite high. Defining goals and outcomes with respect to variables is the difference between creating a system that, for example, severely targets teenage mothers as potential abusers if some implementers decided that teen pregnancy was the thing to be avoided, or if this model should strictly aim to reduce the number of children returning to the legal system with a case of abuse. Professor Vaithianathan really stressed how, when working with these problems that involve entire groups of people, it’s extremely important to define what is the circumstance that wants to be avoided or optimized. This needs to be as far from controversial as possible. This sounds trivial, but I also can’t echo the sentiment loud enough. While I was in my corner of the office debating with myself whether a leaf was a good leaf or a bad leaf, people are using the exact same tools to define which behaviors should be targeted and mitigated at the scale of an entire county.

As a data scientist, there is a lot of power and longevity in decisions made at the time of implementation. The boundary between the data scientist skilled in developing an extremely accurate model with respect to the quantitative, canonical “Accuracy” value output by their model and the data scientist skilled at solving problems is nonexistent. It’s the same. These decisions are worth slowing down for. What the data actually represents is worth having a conversation about and admitting where you might be wrong. I’ve been wrong more than my fair share of times, but I’m still around asking questions, sometimes staring at leaves hoping answers will reveal themselves like a Magic Eye autostereogram.

AI Research Lab Kampala

The AI & Data Science research group at Makerere University specialises in the application of artificial intelligence and data science - including, for example, methods from machine learning, computer vision and predictive analytics - to problems in the developing world.

Dominiquo Santistevan

Written by

I like Data Science and how it's used wrt other things.

AI Research Lab Kampala

The AI & Data Science research group at Makerere University specialises in the application of artificial intelligence and data science - including, for example, methods from machine learning, computer vision and predictive analytics - to problems in the developing world.

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade