The “Super” Data Person Toolkit

Andrew Escay
Data @ First Circle
7 min readApr 16, 2020

People tend to focus on sharing which online classes you need to take, what software to learn, or what personal project involves tinkering around with data. Interviews tend to prioritize these over anything. However, there’s rarely any talk about the other end once you’re hired: how are organizations set up to empower the data people to achieve their full potential?

The support system I discuss in this post will focus on the Data Stack of First Circle and how this serves as the backbone that enables our data team to align with the ever-increasing expectations of business stakeholders.

About me

I’m one of the first three members of First Circle’s data team. My love for the data practice began with an amazing internship which jump started my learning. I went down the path of doing online classes, attending workshops and bootcamps, and doing some personal projects. I’ve since become very obsessed with data products which help empower data analysts.

Development Environment — What you expect to use for 90% of your work

To clear out some terminology, the Development Environment (‘Dev Environment’ for short), is the collection of software that helps an analyst deliver output. Your local laptop and software such as Microsoft Excel, could be coined as your dev environment.

We could imagine, however, the limitations imposed in terms of scale (handling large datasets), collaboration (it’s not always a nice experience to pass around a USB, or download from GDrive), and permissioning (what if your laptop gets stolen?).

At our team in First Circle, we’ve created a cloud-based Dev environment with everything you need to do your job. This can be turned on in minutes, accessed from any browser (that includes iPads and Chromebooks), and includes all the tools necessary to start working.

The choice of the “best” tool varies from business to business depending on their specific needs. Our stack in particular, also continuously evolves as we adjust to the needs of the business. What remains constant and what I’d like to highlight in this post are the principles behind how we designed our Dev Environment.

Principle 1: Flexibility

“What programming language should I use?”

As a fresh grad, I was always caught up in the R vs Python debate for data analytics/science. However, this debate always ends up one way or another with this core principle: “Whatever gets the job done to the best of your ability, in the most efficient way possible.” This led us to design the environment for both use cases — depending on the preference.

First Circle’s Tools of Choice: RStudio (R), JupyterLab or Notebooks (Python), and VSCode

Source: Coding Language Alignment Chart from Reddit. This is a meme.

Principle 2: Consistency

“How do we make our analysis reliable and reproducible?”

When you start working in a data team (or any team that deals with tech, really), it is almost inevitable to hear the following phrase: “This runs on my computer, I don’t know why it doesn’t run on yours.”

This is always because a package is missing, or you forgot to set up this password in your configuration file, or the code you built wasn’t in the same operating system as that of the user. To solve this, we opted to use Docker. Our development environments all build from Docker Images which are pre-configured to have all the packages, configurations, and variables that any data team member needs. This standardizes most, if not all, of our environments, and the team has almost never heard that phrase mentioned earlier.

First Circle’s Tool of Choice: Docker

Source: Julia Evans

Principle 3: Scalability

“How do we make our analysis reliable and reproducible?”

The resource requirements of data workflows can vary significantly. If you’re writing a simple script to do simple pivots and calculations, you likely don’t need more than 2GB of RAM. However, the moment you start to run Machine Learning Training jobs, it is not uncommon to need more than the 16GB of RAM that (hopefully) came with your company-issued laptop. Of course, what better way to solve the problem of scale than to employ cloud solutions! When you log in to our development environment page, you’re asked to select what size of cloud computer you would need (ranging between various CPU and RAM configurations), and it will spin it up for you. This gives us the ability to conserve resources with smaller computers when we run simpler jobs, and scale the computer up with a click of a button the second we need more resources.

First Circle’s Tools of Choice: Kubernetes & cloud service providers

Source: Smooth Sailing with Kubernetes by Google Cloud

Principle 4: Mobility

“My laptop is broken (or lost)… what now?”

Since we are working on a cloud-based system, all of the work of your chosen IDE (JupyterLab, RStudio, VSCode) is running on some remote computer. This allows any team member to start work on one device and swap to another instantly just by logging in on the new device and continuing work there. It also allows the ability to start workflows and just shut your computer down, knowing that your workflow is still running up in the cloud until you come back to reconnect. This prevents long hours staring at your computer until your code finishes running, or not being able to leave the office because you need to be connected to the internet.

Principle 5: Security and Administrator Oversight

“Okay… but how do we control against breaches?”

Hopefully, I’ve been able to outline the power of the Cloud-based Dev Environment. In the wrong hands, however, it can be destructive, which makes it crucial to have a way to monitor usage and manage access to the development environment. This is where we turn to JupyterHub which helps us orchestrate, manage, and monitor the usage of the development environment by multiple users.

First Circle’s Tool of Choice: JupyterHub

Jupyterhub

The Toolkit Accessories

So far, I’ve been sharing about the tools we used to build the main development environment at our organization, but it really wouldn’t add as much value without the help of other tools that complete our data stack. This section briefly covers the other tools we use to complete our end-to-end data stack which is managed by six data team members serving an organization of over 150 people.

Transformation: “How do we make data digestible”

Tool of choice: dbt

We rely on dbt as the way to get data transformed to the appropriate format so that analysts and other business stakeholders can have a seamless experience using data. Without dbt, the whole process of data cleaning, processing, and formatting becomes a cumbersome task that requires hours (or even days) to organize, even with the best development environment.

dbt

Reporting: “How do we communicate our output?”

Tools of choice: automated reports on Slack, Google Sheet Reports, Knowledge Repo

Without these tools, all the insights we generate in our Dev environment would be much harder to disseminate across the organization. Often, stellar work tends to go to waste because the right people in the business don’t know where to find that work. We have found these three tools have helped us put a spotlight on the work we do and make our reports meaningful and easy to digest for our end users.

Slack and Google Sheets

Automation: “Let’s automate our jobs!”

Tool of choice: Apache Airflow

Repeating analysis, presenting metrics, and delivering reports will always be part of the role of a data person. Having a powerful tool to automate these workflows will help save a lot of time and effort. The tool we’ve come to love is Airflow because of the flexibility, stability, and monitoring capabilities it has that makes it easy to sleep at night knowing that your reports and pipelines are all running as intended.

Apache Airflow

Democratization: “How do we enable other people to maximize the value from data?”

Tools of choice: Metabase

Data is relevant to nearly all members of an organization. This is why we believe that it is important to give them the power to explore the data on their own and improve their workflows with the use of data. Metabase has allowed us to give access to explore the data to many more people outside the data team and has continued to bolster the data-driven culture we wish to have in our organization.

Metabase

In Conclusion

By no means do we claim to have the perfect data tool kit, but we’ve spent a lot of time designing, iterating, and developing our data stack to help us move towards that path. We believe strongly that a well-built data stack is a major part of what it takes to hone the skills of good talent and create a strong environment for success. If this post got you thinking more about how to design data stacks and building successful data teams, feel free to reach out to our team on LinkedIn and get in touch! If you feel as strongly as we do and want to discuss the path to becoming a super data person, do reach out to us at https://www.firstcircle.ph/careers. We’d love to chat!

Shout out to Nigel Rimando for editing and the First Circle data team!

--

--