99P Labs Internship Reflection — Summer 2021

Published in

99P Labs

7 min readAug 13, 2021

By Isabel Zavian

This summer, I had the incredible opportunity to work with the 99P Labs Connected Technologies Research Group as a Data Analyst Intern. While I had prior experience working on Data Science and Machine Learning projects, I gained a tremendous amount of new knowledge, whether it was merely learning more about the company’s structure or analyzing data and building models.

After officially completing the onboarding process, I started my days off by attending daily meetings with the entire team. On Mondays, we had an hour-long Sprint Planning Meeting to set all the tasks for the upcoming week. Throughout the middle of the week, we had 15-minute Daily Standups to briefly explain what we worked on the day before, update the task status, indicate any blockers, and share our goals for the current day. On Fridays, we had hour-long Sprint Reviews to reflect on the week, accept any completed tasks, and present interesting work we’d completed if we felt comfortable with doing so.

These frequent check-ins were extremely valuable to me because I realized how impactful they were towards the team’s overall communication levels, progress, and success.

Now let’s dive into some of the technical tasks I worked on and what I learned from each one.

From Pandas to PySpark

One of the initial challenges I faced was adjusting to PySpark, which is an interface for Apache Spark in Python. Most of the projects and coursework I had completed prior to this internship used the Pandas Python library since it was compatible and efficient with smaller datasets. It took about 2–3 weeks to fully become familiar with this new interface, but I’m extremely grateful to have learned it because it’s widely used across various companies and is useful when dealing with large amounts of data.

PySpark was the main interface I used throughout my entire internship as I worked on a machine learning project that was already in progress for about 1–2 years beforehand. Although there were a variety of different tasks to complete, I was mainly responsible for these four separate goals:

Goal 1: Exploratory Data Analysis
Goal 2: Enrichment Process & Visualizations
Goal 3: Feature Selection & Feature Engineering
Goal 4: Balancing Methods & Model MVP

Exploratory Data Analysis

One of my early tasks for the project was to conduct exploratory data analysis on three main datasets containing hundreds of columns in total, check for consistencies across each one, and identify what features might be worth exploring for our end model.

A major lesson throughout the EDA process was to never make assumptions and always check the units for the features in the dataset. There were times where I questioned whether some records in the datasets were either misspelled or abbreviated differently for various terms when they actually weren’t. Some features also contained information with different units in the same column, so it was essential to convert them to the same one before moving forward to avoid any inconsistencies in future steps.

Enrichment Process & Visualizations

After all the EDA was completed, the next step was to create an enrichment module containing any prototypes used for cleaning and improving the datasets, such as removing unnecessary values or creating unique ID’s for a given characteristic. These enrichments are necessary when running them across multiple months or years worth of data so that all the information stored in the S3 buckets can be automatically filtered as it’s pulled in. Doing all of this manually is tedious and would take up an enormous amount of time, so it was important to create a module that can take care of it at once.

Although I had prior experience working with object oriented programming, creating these modules for enriching data was something new for me. It took some time to fully understand the way the GTC Parsing Framework was structured since it was complex, but drawing out diagrams helped me understand it better and successfully implement the necessary modules.

Once the pipelines were run across the necessary time frame, I created visualizations to interpret the information in the summary dataframes more quickly and effectively. Although I worked on many different visualizations from bar plots to scatter plots with a hover feature, the most useful one was the heatmap, which used color coding to represent the proportional values in the datasets. These were extremely useful from an EDA standpoint because they indicated the sparsity or overall data quality of the features we were working with.

Feature Selection + Feature Engineering

With the visualizations created, the next step was to begin the feature selection and engineering process to prepare for the final MVP. One helpful resource for this was Mural, a digital workspace for visual collaboration. My team and I had multiple brainstorming sessions where each individual added a couple of sticky notes regarding feature relationships to further analyze and consider for the model creation process. These sticky notes were then clustered based on topics and then added to a graph with measures of importance and feasibility. Those with the highest priority were narrowed down even further until there were about 7–14 features to build the first phase of the machine learning model with.

Since I had only worked with a couple of features in previous projects, the feature selection and engineering phase with over 50 potential options originally seemed daunting. However, learning about how to narrow down and prioritize each of them with visualization tools made the process more straightforward and helped me become more creative with problem solving.

After working on cleaning the data and prototyping feature summary statistics, such as mean, variance, and standard deviation, it was time to balance the dataset.

Balancing Methods + Model MVP

Balancing is crucial when the dataset being used for the model is imbalanced, or in other words, when the classes aren’t represented equally. If the dataset isn’t properly balanced, there is a significant risk of underfitting or overfitting and ending up with misleading scores. For example, let’s say the training set contains 95 values that are classified as green and 5 that are classified as blue. Since the model is trained only on these observations, it’ll begin to classify almost all unseen data as green and result in a very high and deceiving accuracy value.

There were three balancing methods that I implemented: SMOTE, Random Oversampling, and Random Undersampling. Each one was tested on a Logistic Regression and Random Forest model with some additional metrics, and the measure of success was calculated by evaluating the ROC-AUC score. The Random Oversampling technique seemed to be performing the best for the Logistic Regression model with a score of over 73%.

While there were many additional models, such as XGBoost, and techniques to test, the main takeaway was that balancing and minimizing overfitting are necessary before tuning any hyperparamaters to further improve the overall model performance and final ROC-AUC score.

Organization

One of the biggest lessons I learned throughout my experience was the importance of organization, especially as a Data Scientist or Data Analyst. It’s easy to get lost in the JupyterHub notebooks and forget your initial approach to assignments when you refer to the material another day. It’s even more difficult if someone else has to follow your work and try to understand everything without sufficient context. Therefore, I learned that it’s crucial to section or split the notebooks based on different topics, add comments to almost every cell, and name functions and variables in a detailed manner.

Besides being organized from a technical standpoint, it’s also important to store information efficiently in various sources. Whether I had completed a large task or temporarily moved on to a different one halfway through, I made sure to document all the necessary updates and progress through Powerpoint slides and Confluence, a web-based corporate Wikipedia. This was a great way to explain my entire thought process for a task, share any useful code or visualizations, provide screenshots of observations, and attach files and notebooks for others to use as well.

As mentioned earlier, Mural was also another helpful resource when working on the feature selection process. After learning about its functionality and observing how to categorize complex information in a condensed manner, I was able to have a better understanding of the steps it takes to build a machine learning model.

Taking advantage of these resources taught me to be more proactive when it comes to staying organized because there might be a lot of powerful tools out there that you don’t know about yet. Having access to an entire wave of unfamiliar resources completely transformed the way I approached all of my tasks and introduced a sense of creativity that I hope to apply in the future.

Final Remarks

My summer experience with the 99P Labs Connected Technologies Research Group has not only strengthened my passion for Data Science and teamwork, but also significantly transformed and improved my technical and behavioral skills. I’ve gained knowledge in a variety of different areas ranging from machine learning concepts to effective communication and organization within a team. I’d like to give a special thanks to my supervisor, Rajeev Chhajer, and my mentor, Eric Bauer, for giving me the opportunity to work with such an innovative company, for supporting me throughout the entire summer, and for introducing me to new challenges that significantly shaped my personal growth. I’d also like to thank the rest of my team for being flexible with their time and helping out with any issues that I encountered. I am eager to apply everything I’ve taken away from this experience to all of my future endeavors and couldn’t have hoped for a better way to start my professional journey.