Transcript: Women Who Code — Silicon Valley Full Interview with Dr. Chantal D. Larose

Tenured Associate Professor of Statistics & Data Science, Department of Mathematical Sciences at Eastern Connecticut State University

Dianne Jardinez
WomenWhoCode Silicon Valley
16 min readNov 25, 2020

--

Image of Dr. Chantal Larose
Dr. Chantal D. Larose, Associate Professor of Statistics & Data Science at Eastern Connecticut State University

Dr. Chantal Larose’s expertise ranges from predictive analytics, through data science, missing data analysis, and R programming. She loves to find the story behind the data. As a Tenured Associate Professor of Statistics and Data Science in the Department of Mathematical Sciences at Eastern Connecticut State University, her objective is to inspire students to love statistical analysis as much as she does, whether the students are math-whizzes or math-phobes. Her Ph.D. in Statistics is from the University of Connecticut in Storrs, Connecticut with the dissertation, “Model-Based Clustering of Incomplete Data.”

I had the pleasure of interviewing Dr. Chantal Larose to learn about her passion for statistics and how it shaped her career, took a deep-dive into incomplete data and multiple imputation, and discussed her unique perspective in her field and how to be successful in it. The main content of our interview is below. On behalf of Women Who Code — Silicon Valley, I thank Dr. Chantal Larose and appreciate her time being apart of our #ShoutoutSaturday series.

Passion for Statistics

What was the turning point for you in finding your passion for statistics, especially finding your love for statistics as an Undergrad?

Yes. Well, looking back, hindsight, 2020, I always was a little statistically inclined, for example, I’d be a kid and open up my M&Ms. And I would end up making little bar charts of the M&M colors, like here, all the red ones, and here, all the brown ones, and all the different colors, and then proceed to devour all of them. But that kind of behavior, looking back from where I ended up here, it kind of makes a little bit of sense.

And then when I was getting older, my dad was a statistics professor. And on occasion, in the course of conversation, he would bring up and explain, a concept or two of interest, very casually, you know, no pressure, it’s not like you’re going to learn about standard deviation today, it was very accessible. Actually ended up learning about standard deviation over lunch using tater tots. It was great. If they’re all bunched together on the plate, then you know, if they’re all very close together it’s a low standard deviation and if they’re all spread out all over the place it’s a high standard deviation; a lot of food going on in my early academic career.

So when it came time for college, I knew I wanted some kind of STEM career. And I knew I wanted to be a professor. But I didn’t know what subject I wanted to specialize in yet. I had taken physics classes, I had taken up to CP Calculus in high school. And I wasn’t ready to be like, “Okay, this is the particular field, I’m going to spend the next, you know, X number of years in school for.” So I went for my Undergrad in statistics, specifically thinking, I’m not sure what field I want to end up in. It could be stats, it could be math, it could be physics, it could be engineering, but I like STEM. And since statistics is used in a lot of different places, I can pivot. I knew once I made up my mind and finish the Undergrad degree, I could pivot to a particular field, once I figured out what I really liked. And then I just grew into statistics. As as I was going through the Undergrad degree, if I have to point to one particular class that I can remember really enjoying, it would be my regression course, it was called Analysis of Experiments. And it was all the different regression models and multicollinearity and data transformations. And the exam booklets were enormous because there were just pages of regression output, sums of squares, residual plots, and during the exam, you had to figure out what output to use.

Quote by Dr. Larose
“Since statistics is used in a lot of different places…I knew once I made up my mind and finish the undergrad degree, I could pivot to a particular field…and then I just grew into statistics.”

So you were given more than you needed. And you were given some questions, and you had to figure out okay, where is the stuff that I need to answer the question? What do these numbers mean? And then how can I use that to answer whatever I’m asked in the exam? And it felt really rewarding. You got to glean these stories from all these different data sets, and you got to pull these interpretations out of a mess of data and output, and it was all very practical and very story-driven. And it turns out, that in particular, is what I love doing. I love what I love to teach. What I love to do in my research, what I love to put in my textbooks is like the how and why. And what does it mean of statistics and data science? So this was my favorite thing about my latest book, called Data Science using Python and R, which is being able to explain the code line by line, step by step, and then walk the reader through why are we using this code? Here are the results and what do the results mean?

Incomplete Data with Video Games

It sounds like you have a very methodical approach to data. Can you go deeper about incomplete data? What’s interesting about analyzing it? And does that align with how your methodology is when when you “attack” data?

Ah, and I do alright. Short answer. Yes, it aligns. And just before I talk about incomplete data, you mentioned that it’s a very methodical approach to data analysis. And it’s definitely an important part, it’s I feel a lot of people say that their subject is like a mix of art and science, which is true for a lot of people in STEM. And it’s also true for data science because you need the methodology, you need the step by step to go, okay, you need to understand the problem, you need to pre-process the data, you need to establish the baseline. But there’s so much room for creativity and exploration, and balancing those two is a lot of fun.

Quote by Dr. Larose
“I feel a lot of people say that their subject is like a mix of art and science, which is true for a lot of people in STEM.”

But to your original question, I love incomplete data. It excites me; it’s really interesting. So let’s start with a motivating example, very, methodical, here, very step by step. I love video games. So that’s going to be part of our context. Say you give out a survey to people on how many hours do you spend playing video games per day? On average? That’s question one. Question two, how old are you? You get all the answers back and say 10% of people have skipped over the video game question. So 10% of the people that you gave your survey to, have only filled in their age, and not the video game question.

So what do you do? Well, if you want to model video game hours by age, you’re in a little bit of a bind. Because if you go to do that regression, most statistical programs, by default, will exclude the incomplete records from the analysis. So we have 10% of our video game variables missing. If we go do the regression anyway, without addressing it, not only are those incomplete values for video games going to be gone but their associated age values, which we still have, will also be gone. So removing the incomplete records is inefficient, but it can also be dangerous for your final result. How dangerous depends on what kind of missing data you’re dealing with. Maybe that random 10% just forgot about the question, there’s no system to it. It’s just 10% of the time people forgot to do the question. There’s no pattern. And in that case, throwing out the incomplete records is not going to introduce any bias to results, but it’s still inefficient. It’s going to shrink the sample size, and it’s going to be a waste of time and money.

Quote by Dr. Larose
“Removing the incomplete records is inefficient. But it can also be dangerous for your final result. How dangerous depends on what kind of missing data you’re dealing with.”

But what about some more complex cases? For example, what if older people tended not to answer that question? Now all of a sudden, you’re missing that this isn’t random. You’re missing this data that is based on an observed variable. The older you are, the more likely you are to skip the question in this simple example. So that’s one kind of pattern to the missingness.

Another kind of possible pattern is that people with a very high daily average of playing video games skipped that question because they didn’t want to admit how much they play video games. You still have missing data, but it’s not based on any other observed value. It’s based on what the missing value itself would be. So people who spend tons of time playing video games per day do not answer the video game question.

Now we have three kinds of missing patterns in the data. And if there’s any kind of pattern in the missingness, if it’s based on an observed variable, like age, or if it’s based on the variable itself that is missing, like if you feel you play video games too much, you’re not going to answer the video game question. If you have that kind of missingness and then you go to take your data and create a regression model on video games and age, you’re going to get bias in your results. For example, if older people tended not to answer the video game question, you’re going to lose those older records so your complete data is going to look much younger than your data would if you had kept all of the records.

The Multiple Imputation Approach

So missing data is a problem. It’s a problem in survey data. But it’s also a problem when you’re using sensors or machines to collect data if they happen to malfunction or some other application like that. And there are many smart people who have developed ways to address missing data. There’s a whole bunch of them. I’m going to go skip right to my favorite, which is called multiple imputation. Multiple imputation uses statistical models to generate a bunch of different potential values for each missing value. So you’ve got one hole in your data, you can generate five potential data points, all of which are different to fill in that one hole in your data.

Quote by Dr. Larose
“Missing data is a problem…And there are many smart people who have developed ways to address missing data. There’s a whole bunch of them.”

And so what you can do is, once you’ve generated these five potential values, you put them into your data, Okay, first, we’re going to put the first potential value in the data set, this is going to give us a complete dataset. Okay, great. Do that, again, with the second imputed data and you’re going to get another complete data set, but it’s going to be different from the first one, and so on. With multiple imputation, you’re going to end up with multiple, complete datasets, which have all of your original data, including, for example, the older people in your data set and they’re going to differ from each other based on what simulated value you used to plug the hole in your data. You end up with, say, five different data sets. And now you can do your regression model. In fact, you can do your regression model five different times with five different data sets and you’re going to get five different models. They may be very different, they may be not so different, but they will be different because you’re regressing on different data sets.

And what multiple imputation does in its last stage, is it lets you combine those regression estimates and standard errors in such a way that it enables you to get what you really wanted in the first place, which was a single regression solution. We just want to get our regression equation and missing data is kind of something we have to solve along the way. But multiple imputation also lets us talk about how variable the original data was and how variable your imputed data is. For example, in those five data sets or five data points that you used to fill in the holes in your data, there’s variability there. We’ve got variability in the original data. We’ve got variability in the imputations. And thus we can talk about how variable your ultimate answer is. A lot of the time if you use other methods to fix missing data, you can’t talk about how variable your imputed values were but with multiple imputation it says, you know, yes, we give you this nice answer, and yes, the data has some variability. And yes, we’re acknowledging that our solution to it also has some variability. So it’s this kind of really elegant solution to address the missing data problem, while still accounting for the variability in the way you solve the problem.

How do you decide on the approach for multiple imputation? And what does it involve?

There are so many different ways that you can impute your data. And yeah, there is mean imputation, median imputation, there’s single imputation, where you know, you can fit, say, a normal curve, if your data is normally distributed and generate a value from there, but only once. I work primarily in R, the statistical programming language, and there’s a bunch of packages that offer different approaches to multiple imputation.

The thing about multiple imputation is that you have to make sure that your imputations match the distribution of your original variables. For example, with mean imputation, if you’re just taking the mean of a variable, and you’re using the mean, to fill in all the gaps, if you have what used to be a normal curve, and you use the mean, to fill in all the missing values, what you end up with is a normal curve with a spike in the middle. So there’s less variability in the imputations that you’re making and you’re reducing the variability in the variable and one of the things that multiple imputation and single imputation do is they let you preserve the variability that you had originally.

And there’s no one black box way to do the best imputation. The way that you impute the data is going to depend on a bunch of different things, including the pattern of your data. So there are packages that are specifically designed to impute normal distributed variables or say a bivariate normal variable, you can impute values from that. But if you have different distributions, you need to take extra care and it’s not just plug in your data get imputations out process. If you’re building a model, you want to do enough exploratory data analysis that you kind of anticipate some expected results so you’re not surprised by your model. You want to make sure that what comes out of the multiple imputation packages; your imputations match what you expected it to be. You need to definitely have that awareness of what I’m doing is going to impact the data, is going to impact the model down the road.

Quote by Dr. Larose
“There’s no one black box way to do the best imputation. The way that you impute the data is going to depend on a bunch of different things, including the pattern of your data.”

Dr. Larose’s Perspective in her Field

When you transform data, there’s a certain voice or perspective that you have; how do you find your unique voice as an Analyst and as a Professor?

Well, in my student evaluations, for my first semester as a teaching assistant and brand new graduate student, I was called quirky, by two different students in their evaluations. So finding my voice has apparently never really been the major problem. The thing was, I was a little reluctant to use it at first. I know that I like to be really enthusiastic about things like graphs and normalization, love normalized histograms; and putting in video game references or things like that, but as I’ve gotten more comfortable using that voice, and really just more comfortable in my skin as a data analyst and a data scientist, I find a lot more fun in what I’m doing because I can be a little bit more relaxed about it, more genuine about getting excited over things like standard error of regression statistics and all of this.

And it applies to teaching too, in a big way. I’m finding that the more I get into it in my courses, I find that there seems to be a positive effect. So when I started introducing, again, video game examples, and some more fun datasets into my courses; and I teach the core Data Science courses here at Eastern Connecticut State University so courses such as Introduction to Data Science, going up through regression, one-sample inference, classification and model evaluation, and then estimation, and misclassification cost and clustering. So through those courses, I started using, say pictures of favorite movies or video games, or what have you. In my course notes, I found that the students responded much more positively and seemed to get more comfortable with the material more quickly than when I was just kind of doing my non-video game movie-related approach to the lecture. And so I’m teaching data science this semester and I’ve noticed that some students have even started replicating that style when completing their assignments.

Quote by Dr. Larose
“As I’ve gotten more comfortable using that voice, and really just more comfortable in my skin as a data analyst and a data scientist, I find a lot more fun in what I’m doing.”

This was a surprise to me but I just created the second report for my Intro to Data Science course for CART models and C5.0 models, to be specific, we were looking at data on classifying hazardous asteroids based on a bunch of different factors. And multiple students used pictures or video game or movie references to give some flavor to the subject matter of their report. And so I saw a lot of pictures of asteroids on their title pages, I saw one report with some pictures from the Toy Story franchise because of the Buzz Lightyear connection to space, and even saw in a different report a reference to the video game Among Us, because of the connection with trying to suss out hazardous events from a data set. And since I find that statistics and data science can be an intimidating field to get into, for some people, I’m just delighted that I can make my students more comfortable with learning. And if that means using pictures of Animal Crossing, when I talk about cross-validation, then that’s what it’s going to be. It’s really nice to see what they’re interested in and have them connected to the course material. I feel that it motivates them more, especially since I’m teaching online this semester, because of COVID-19. Motivation has been a major topic, this semester and last semester, and to get them really interested to just dig into a data set and have some fun with it, I love to see that.

Being Successful in your Field

Okay, so now for our last question, what are your words of wisdom? For analysts or professors? And do you have any tips for success for women in this field?

For data analysts and professors, I find it’s easy when you’re talking about your work, to be obtuse, to throw formulas around. But what I enjoy when I’m presenting is to see people get it, as I’m talking, as I’m presenting at a conference, for example. So I like to go the extra step to make sure that I bring the ideas in my presentations back to something accessible. So my advice to others would be to present your work in an accessible way. It’s not a bad thing and it doesn’t mean that you’re not smart just because people who have never heard about your work before can follow the main points of what you’re saying. In fact, it can help get people excited about what you’re excited about, which my whole thing is to get people motivated so we can share in the excitement that is data science.

Quote by Dr. Larose
“Presenting your work in an accessible way. It’s not a bad thing… In fact, it can help get people excited about what you’re excited about, which is my whole thing is to get people motivated.”

And tips for women? Well, everyone’s experience is going to be different especially in tech, and especially in STEM, everyone is coming from different experiences. But I would say, don’t feel that you have to do everything. If you want to get involved in a million different things, and that’s what inspires you, and it motivates you and excites you to get to work, that’s amazing. And you know, follow that energy. But if, on the other hand, you’re happiest on a smaller set of things that you can really focus and pour yourself into, and that’s what gives you the energy to work, do that instead. It’s okay to set boundaries and say, I’m working on these things.

Quote by Dr. Larose
“Don’t feel that you have to do everything…It’s okay to set boundaries and say, I’m working on these things.”

And make sure you are or try to, cultivate interests outside of your work. Because you are a complete person; you are defined by more than what your job is, or what you got your degree in, or what your professional interests are. So, you know, try to keep that balance of what really motivates you and work on what speaks to you outside of work. I find that it’s more fun when you find what you’re passionate about.

Dr. Chantal Larose recently co-authored her third textbook on Data Science, Data Science using Python and R. This textbook will get you plugged into the world’s two most widespread open-source platforms for data science: Python and R, and was written for the general reader with no previous analytics or programming experience. You can read a printed or e-book version on the John Wiley & Sons, Inc. website.

For any additional questions, you can reach Dr. Chantal Larose at larosec@easternct.edu or connect with her through LinkedIn.

Want to hear more about our #ShoutoutSaturday series?
Follow our Official Blog for the WomenWhoCode Silicon Valley chapter!

Join us on Saturday, December 5, 2020 where we speak with Swapna Savant, Engineering Manager at Headspace, on “How to Be Intentional About Career Growth as an Engineer” for our next #ShoutoutSaturday content.

To get more updates about this event and our series, follow our social media platforms at linktr.ee/wwcodesv

--

--

Dianne Jardinez
WomenWhoCode Silicon Valley

Leading the effort on the #ShoutoutSaturday blog series for the WomenWhoCode Silicon Valley chapter. Join our community at linktr.ee/wwcodesv