Data doesn’t lie
How high school students can use data science to drive solutions
Most high school students are only exposed to computational logic and have no understanding of the power of data science. Data has permeated all aspects of our lives and, if analyzed correctly and in context, data can be an extremely powerful tool that provides evidence, insights, and predictions into different aspects of the world around us that may not be so easily visible. Most people think of using data science to drive business decisions and increase profits.
However, I recently started thinking about the power of data science as an effective tool to drive decision making and removing systemic biases in the context of the recent pandemic situation and the protests happening all across the country.
Not long ago, I got the opportunity to work on a small data analysis project through an organization called AI4ALL. As part of their alumni program, I attended a two-day workshop at the Salesforce Tower in San Francisco. The goal of the workshop was to expose high school students to the possibilities of data analysis and teach them how to gain valuable insights from a dataset. In addition, AI4ALL also wanted to show the team at Salesforce what younger students can achieve on their own. I was selected to present my previous research project at the workshop and I was excited because, for the first time, my parents allowed me to travel solo on Bart to San Francisco.
We were introduced to two of Salesforce’s most well-known data visualization and modeling software products: Tableau and Einstein Analytics Studio to use with our data analysis project. I explored the product capabilities and after some time developed a pretty good idea of its capabilities. However, I had yet to identify a data analysis problem. As I traveled on the train back to San Jose on the first day, I was thinking, “What is a relevant and recent problem in the world right now?” That was when I noticed a few people wearing masks and then, it hit me.
Why don’t I track the spread of the coronavirus?
During the time of the workshop (late February), COVID-19 was still a relatively new problem for the world, and the US only had a handful of cases. There wasn’t much data available online to form a fully developed coronavirus simulation, but by using the John Hopkins Coronavirus live dataset, we were able to create a simple projection of the spread of the virus. We took a time-series dataset that provided details on the number of confirmed cases, recovered, and deaths and created a basic simulation predicting the spread of the virus through Tableau.
However, the model in tableau was not as robust because the predictions only stretched to a few days and the range of error was extremely high. So, I created a new supervised ML model using the Einstein Analytics Studio, which was more solutions-driven and suggested the best way to minimize the spread of the outbreak. Creating the model in Einstein Analytics Studio took less than 30 minutes as it was fully automated. However, this also limited our ability to understand how the model worked, and in essence, the model remained a “Black Box ‘’.
Based on the dataset we used, our projection showed that the United States would face an exponential increase in the number of cases and deaths within the next few weeks. Initially, we were somewhat dismissive of the simulation and thought this projection was merely a remote possibility, especially because we did not have enough data to ground our findings. However, looking back now, I realize how scarily accurate our projection had been.
We also had a “paired lunch” session during which the AI4ALL alumnus was partnered with a Salesforce employee so that they could form a mentee-mentor relationship that could continue after the workshop. Personally, this was the most exciting session of the workshop since I connected with many interesting people and learned about their divergent paths that led them to a tech career at Salesforce. And of course, the food and the SWAG was really good.
I presented on a “Disaster Relief NLP” project that I had created with some high school, undergraduate, and graduate students when I first attended the AI4ALL summer program in 2019 at UC Berkeley. This involved a simple Natural Language Processing (NLP) machine learning model that would help detect where to send aid during a natural disaster. By looking at text messages sent to disaster relief hotlines, and performing a binary classification to separate “AID” and “NOT AID” related messages, we were able to sort through thousands of messages in a short amount of time. This was extremely important because when a natural disaster strikes, every second matter, and by sorting these messages efficiently using our model, we were able to cut back on the time it would have taken a human to sort the messages.
Working on this project as a high school student made me realize how easily data analysis could be incorporated to drive life-changing decisions and impact society on a massive scale. It can provide the hard evidence required to drive policy decisions, prevent outbreaks of diseases, send aid where required, and even help with public education campaigns.
However, I also realized that data analysis could have its limitations and to make it meaningful, the following three factors are very important:
(i) Quality of the datasets
(ii) Context around the dataset
(iii) Actionability of the analysis
The quality of a dataset may not always be reliable because of gaps in the methods used to collect the data. Recently, I was introduced to a book called Invisible Women: Data Bias in a World Designed for Men, written by Caroline Criado Perez. The book investigates how the lack of data collection specific to women and their needs is often the cause of systemic biases based on gender. In one of her more incredulous examples, she highlighted how relief efforts to provide housing in India (Gujrat) to a community stricken by disaster completely overlooked the need to build kitchens in the units because women and their needs were completely discounted while assessing the damages to the community. Perez methodically analyzed and exposed the data bias against women by providing examples from various domains ranging from transportation, language, design, healthcare, education, and disaster relief. It was an eye-opening and powerful read because it made me aware of how easy it is to overlook our unconscious biases.
As Perez says, “Failing to collect data on women and their lives means that we continue to naturalize sex and gender discrimination, while at the same time somehow not seeing any of the discrimination.” This also reinforced my belief that diversity is incredibly important in tech to have valid intelligent AI systems because ultimately they rely on valid datasets and models. For example, the bias in facial recognition software can make criminal investigations murky. To design a world that is meant to work for everyone we not only need women in the room but representation from all sections of society.
Reliable data analysis places a lot of importance on the context in which the data is collected and often relies on subject matter expertise in fields other than the technical. For example, analyzing a COVID dataset may require an understanding of biology, geography, healthcare systems, economics, and so on. Only a multidisciplinary approach allows students to learn skills beyond technical computation and bring the analytical mind to various disciplines so that we can ask the right questions and subsequently, identify reliable insights.
Finally, data analysis should result in providing us with the ability to take meaningful actions and define measurable outcomes. The workshop at Salesforce proved to me that appropriate modeling and visualization tools to transform data into usable illustrations, charts, graphs, and spreadsheets are useful to convince people and motivate them towards action.