A Data Scientist Shares Three Lessons from Connecting with End Users
I launched my career as a data scientist not when I started working with big data or took up machine learning but in my first post-Ph.D. job as a data specialist in a school district. Some of the time I spent doing the kind of analyses I had been trained to do as a doctoral student. And some of the time I spent doing completely new and different tasks, like setting up an on-line assessment system. But all of the time I spent there was valuable because all of it made me understand the end user that is now front-and-center for my work as Research and Data Scientist at Panorama Education. Having a deep understanding of the people at the other end of our platform helps me ground my work in what is relevant. And in turn, I avoid launching a project that just doesn’t align with educators’ wants and needs. Below I present the biggest lessons I’ve learned through truly knowing end users and how they shape my current data science work.
Lesson 1: Ask Questions of the Data that the End User Wants Answered
Understanding end users comes into play from the get-go, starting with the research questions I focus on. When deciding which research questions to address with our national data set, I ask myself what the Assistant Superintendent at the district where I worked would want to know from our data. For example, although looking at national trends in education is important, looking across the entire country may not inform schools and districts about what action they should take for their own students. I think back to analyses I did when working for the school district on the district’s own data and how they informed an important policy decision. The Assistant Superintendent talked excitedly about how she was going to be able to go to the Board of Education and say, “These findings are based on our students!” Which is why in my current work I strive to understand national trends after taking into consideration the differences between districts and schools. Contextual factors such as the proportion of students who are English language learners or special education students can make “national findings” seem like they may not speak to an individual school or district. By keeping the perspective I developed during my job in a school district, I’m working to answer research questions that would be useful and actionable to someone who decides on policies and interventions in our schools.
Lesson 2: Focus on Interpretability
While working alongside educators, I learned (the hard way) to focus on interpretability. Data people like myself want to look at the data from multiple angles and then report out everything we find. And initially, that’s what my colleague and I did with a big project for the school district. We were being responsible researchers by running our analyses on a dataset with more students but more limited measures and on a dataset with fewer students but a richer array of measures to harness all the potential insights from the data available. But did I really need to stand in front of a bunch of educators and present two separate sets of findings? The Q & A session helped me understand that presenting both sets of findings introduced unnecessary complications. What I heard from the audience was, “How can your findings tell us one thing but also tell us something different?” And they were right. It’s fine to present complicated findings to a lay audience, but it’s not OK to make them more complicated than they need to be. In a world of machine learning and highly sophisticated models, a focus on interpretability can be tricky, to say the least. Fortunately, data scientists are recognizing that a little bit of added accuracy may not be worth a ton of additional complexity. Additionally, data scientists are also creating tools to help people “peek” into our models, like the iml package (which stands for interpretable machine learning) in R that shows which features in a model were most important and how they influence predictions.
Lesson 3: Think about How Things Work in the Real World
And, related to the above, I use everything I know about educators when building features for our products. I recently used machine learning and more traditional methods (like linear regression) to predict students’ future assessment scores based on their past scores. The aim was to find the best method (taking into account interpretability, of course!) to accurately predict assessment scores for students receiving extra help in the form of reading interventions. Educators want to know whether these students are progressing enough to earn a specified score on the assessment by the end of intervention. In other words, are students likely to meet the goal of the intervention? If educators have an accurate prediction early that indicates a lower-than-expected score, then they can adjust what support they offer those students mid-way through the intervention. Although I may improve the accuracy of my models by including exogenous variables — such as which day of the week the assessment is given — this level of specificity doesn’t gel with educators’ on-the-ground experience. Why would I build into my model something that adjusts up or down the predicted score based just on the day of the week when I know that assessments scores aren’t measured automatically? We get assessment scores when one group of human beings (educators) administers the assessment to another group of human beings (students). That only happens if the following conditions are met:
- There is no school holiday.
- The educators giving the assessment are not absent from class.
- The students taking the assessment are not absent from class.
- There is no major disruption that prevents students from taking the assessment like a school assembly, a fire drill, the assessed student getting called to the office, support services such as speech therapy happening when the assessment was scheduled (the list goes on and on).
- Importantly for our current times, a pandemic does not shut down schools and effectively cancel assessments for the spring of the academic year.
- You get the point. Assessment administration can be unpredictable, so I wouldn’t want to build a level of predictability into my model that is at odds with what happens in the real world. Sure, the more complicated model would appear to perform better based on assessment scores from the past, but I chose a model that would work for predicting assessment scores in the future so that educators can make informed decisions today.
Conclusion: How I Stay Connected Now
So I hope you’re all convinced that you should stop doing data science for a while to embed yourself with the same types of people who make up your end users. No, you’re not? Well, that’s OK because more than 5 years after leaving my job in a school district, I still find ways to really understand end users. Like the “ask-a-teacher” Slack channel Panorama has, where former teachers will answer questions from people like me. It’s a go-to when I need an inside opinion from someone who has lived that experience. Or by joining on calls with our actual clients, who are facing an ever-changing education landscape, as we solicit their feedback on new features. Or through key insights that client-facing teammates share from all their great work directly with the educators on the ground. And in this way, I can grow my technical skills as a data scientist and keep my connection to the people who ensure that my findings matter.