Reflections on data science collaboration from an early-in-career data scientist
From January 6 to February 26 of this year, I participated in the Women in Data Science Datathon held virtually on Kaggle and organized by the WiDS (Women in Data Science) Worldwide team, Harvard University IACS, and the WiDS Datathon Committee in partnership with experts from Climate Change AI (CCAI), Lawrence Berkeley National Laboratory (Berkeley Lab), US Environmental Protection Agency (EPA), and MIT Critical Data. The WiDS Competition is a worldwide competition with more than 4,000 registrants representing more than 90 countries, and Microsoft is one of its sponsors.
Data science competitions can be a great way for early-in-career data scientists to build their skills. These datathons facilitate networking among data scientists and allow for collaboration with other data scientists. They are also not just for novice data scientists — experienced professionals also participate to refine their craft.
My colleague Aylin Gunal and I recruited two additional female data scientists by posting a callout in the internal Microsoft Sustainability Connected Community using Microsoft Teams. We all had backgrounds in computer science or engineering, but this was our first time collaborating on a virtual datathon.
The datathon experience
The datathon occurred in phases. In the first one, we used a dataset describing the energy efficiency of buildings to create models that predicted the energy consumption of a building as measured by the site’s energy use intensity (EUI). Features of the dataset included characteristics of the building and weather data for the location of the building. The winner was the team that could most dramatically minimize the RMSE (root-mean-squared error) of the model on the test set. The dataset was created in collaboration with CCAI and the Berkeley Lab and can be found on Kaggle.
We kicked off the competition by having frequent syncs so that we could learn about each other’s personalities and communication styles. After an icebreaker, we dove into conducting exploratory data analyses to familiarize ourselves with the features of our dataset. Afterward, our team started meeting weekly to examine the training dataset, which had 75,757 rows of data with four categorical features and 60 numerical features. In the beginning, we spent meetings discussing categorical and numerical data visualizations, identifying abnormalities in the dataset, and discussing potential algorithms to try. The virtual environment of the competition made these discussions seamless, as it was easy for us to share our screens and our findings with each other.
One of our major challenges was determining how to divide tasks since some of them, such as data modeling and data cleaning, are typically dependent upon one another. Our solution was to operate in handoffs by having each person own a part of the process, and then we would review each step as a team. For example, Aylin drove the data cleaning and presented her work so that we could ask questions, and another teammate, Cooper Cole, researched algorithms for us to try on the data, such as CatBoost and XGBoost. After collecting the results of the data cleaning process and defining the algorithms to test, I worked on tuning the models by using a grid search to find the optimal values for the hyperparameters and used cross-validation to avoid overfitting the model. We conducted three iterations of this process throughout the competition, improving our model performance by 60 percent.
Assigning tasks minimized redundancy and reviewing each other’s work provided a fresh perspective when approaching problems. For example, when Aylin presented her data cleaning work, I was able to ask questions about missing values she had noticed in the dataset and use her findings to propose ideas on the columns we should impute and the ones we should drop from the model.
Taking an over-the-wall approach did result in rework of the data cleaning process. For example, as part of our data cleaning process, we one-hot encoded the categorical data in the training set and used the data to train the model. When we fed the model the testing set, the model had errors because the testing set had categorical features that the training set had not seen before. So, I reworked the solution by including the testing set categorical features in the training set, and that way we could keep the categorical feature in the model.
We agreed that the exploratory nature of data science made it challenging to work in a virtual environment. Often, the scope of the problem changed quickly, making it difficult to ask for support without teammates fully understanding the depth of a problem. In one instance, we wanted to impute missing values for our column by creating a different predictive model. However, the person building that predictive model needed to understand the revised data cleaning process. The interdependencies made it difficult to hand off the work, even though the task could have been done in parallel with tuning the model.
Making collaboration work
I discovered we were not alone in encountering these challenges. I spoke with Hao Ai, a Senior Data Scientist at Microsoft, who validated my experience that collaboration in data science can be challenging, especially when working virtually. She provided some key guidance that I believe can make collaboration in data science easier, both virtually and in person:
- Obtain goal alignment: Refining and revisiting the goal of an engagement can provide needed clarity. When working on a project with a team, sometimes the same goal can be reached in different ways and by asking different questions. As Hao says, “If you have different goals, perspectives, and approaches, it causes collaboration to be less efficient.” Being clear on the goal of the engagement — whether an in-person or virtual engagement — is critical for successful collaboration among data scientists.
- Be mindful when scaling projects: Introducing more people to a project doesn’t always result in equivalent gains: It may lead some astray from a particular goal or it can cause redundancy in reaching a goal. Hao’s solution is to estimate the effort needed for an engagement prior to staffing it with data scientists. As much as we all enjoyed working together on the competition, we felt that most tasks could have been completed by two people at most, and the number of people we had made it difficult to keep everyone on our team up to date.
- Use agendas for meetings: Using agendas for meetings can help avoid free-form discussions from taking over. This can be helpful in keeping both in-person and virtual meetings on track. Moreover, having an agenda helps people prepare deliverables and organize their thoughts ahead of time. This is something our team did not do, but it could have been helpful in case someone could not attend a meeting and wanted to know how the work was changing.
- Set goals for code reviews: It’s important to ensure that work is completed so that it can be understood by someone else. Setting goals for code reviews can help in making code more readable and variable names more understandable. Aylin made her variable names intuitive, and it made the hand-off into model tuning and deployment considerably faster.
- Schedule code deployments: Aiming to deploy code ensures that the team is practicing diligence throughout the coding process and identifying larger structural challenges. For our team, this took the form of upfronting submissions so that we could detect issues with our code earlier in the competition.
- Prepare a presentation alongside the project: The narrative that accompanies a data science project can get lost in the technical elements. Preparing a presentation throughout the project that can be shared with stakeholders and teammates ensures that tasks are aligned with a common goal. While for the datathon we did not have a narrative to prepare, it might have been helpful for us to prepare one anyway so that we could think more in the context of the problem of the different columns and what it might mean to drop or include them.
Hao affirmed that there are challenges of working in a virtual environment, but we agreed that there are advantages as well. For example, virtual meetings allow people to be more mindful when speaking (like using the “hand raising” features in Teams), which is essential for collaborating on a data science team. From my experience, it also makes conducting code reviews much easier because you are no longer hovering over someone’s shoulder to review their code.
Takeaways
From our experience, I would say there are two key takeaways to recommend. First, continue to practice by putting your skills to the test in datathons. Second, have an experienced mentor who can provide feedback. The datathon identified areas of improvement for each of us as individuals. For me, I learned that I was strong in identifying errors in our code that were leading our model astray, but I recognize that I still need to improve my theoretical understanding of ensemble models to improve my model tuning. As a team, we learned the importance of planning out work to identify needed capacity and how to increase communication skills to improve our project handoffs.
As you continue to practice, I recommend engaging an experienced data scientist who can help validate your process and spur self-reflection after completing a datathon. Meeting with Hao helped me distinguish the mistakes I was making due to inexperience (such as learning to fine tune a particular model) from ones that are due to the nature of data science (such as when a project changes scope). Moreover, differentiating between these two types of challenges has encouraged me to focus not only on fine tuning my technical acumen but also on applying Hao’s principles to my day-to-day data engineering work by regularly reviewing code with my peers and pushing for frequent code deployment.
Conclusion
Our team successfully participated in our first datathon — an intimidating and yet huge milestone for us to start practicing our skills in a competitive environment. I’ve come to see that one of the key elements of growth in data science is to practice, which makes participating in datathons such a rich opportunity. Learning not only to build technical depth, but also to collaborate with others is a key element to success for both aspiring and seasoned data scientists. Aylin shared that data science collaboration is different from collaborating on a normal computer science project because tasks are so interdependent and there is always an opportunity to ask, “Have you considered this?” Collaboration in data science is best improved with experience, so my teammates and I look forward to participating in future datathons, where we can carry these lessons with us, refine our plan, and crush the next competition.
Sofia Noejovich is on LinkedIn.