At Windfall, our goal is to determine the net worth of every person in the world. We’re approaching this problem by first targeting affluent households in the US and utilizing a collection of diverse data sources. It’s an interesting challenge for the data science team, because it involves understanding our customer and vendor data sources, building robust models for estimating the net worth of all US households, and providing value to our customers through targeted campaigns. We use a variety of tools to accomplish these goals:
- Data Access: SQL (BigQuery)
- Exploratory Data Analysis (EDA): Jupyter + R kernel
- Model Production: Data Flow (Java + Apache Beam)
- Dashboarding: Google Data Studio
- Reports: Google Docs
- Version Control: Github
- Knowledge Repo: Confluence
- Data Wrangling: Google Sheets + bq (BigQuery command line)
Given the small size of our team, we needed to select tools that empower our data scientists to build predictive models and put models into production. We’ve focused on tools that will enable collaboration as we scale our team, and support reproducible research.
A core aspect of data science at Windfall is working with customer and vendor data in order to determine how to model net worth. SQL is the primary tool we use for performing this work, and most of our data lives in a BigQuery data warehouse. Data scientists sometimes need to work with data that is not available in our warehouse, such as survey data stored in Excel files. In this case, we use scripting languages such as R for data access.
Exploratory Data Analysis (EDA)
One of the key functions of data science at Windfall is understanding how different data points, such as real estate ownership,should be used as input to our net worth models. To perform this type of work, we use Jupyter notebooks with R kernel support for exploratory analysis and R packages such as ggplot2 and plotly for visualizations. Some of the benefits of using Jupyter notebooks are that they support multiple languages in addition to python, provide a convenient web interface, and can be saved and rendered in GitHub. As the team expands, we’ll set up JupyterHub for collaborating on notebooks.
Since Windfall is a small team, data scientists are expected to be able to put models into production. We use Google Cloud’s managed service called DataFlow for production jobs, and author jobs using Java and Apache Beam. To enable easy translation between models trained in R and deployed in Java, we use the Predictive Model Markup Language (PMML), which enables model specifications to be passed between different languages.
While most visualization work by our data scientists is done using scripting languages, it’s often necessary for the team to create automated reports that can be shared with the whole company. For example, we use Dashboards for tracking model performance over time, and make these reports available to everyone at Windfall Data. Since most of our data is in BigQuery, we use Google’s Data Studio to make automated reports.
Being able to communicate research progress and results is an important aspect of data science at any organization. We use Google Docs for report write-ups at Windfall, because it provides a collaborative environment for writing long-form reports and the commenting tool is great for providing feedback to the team. We had considered using Jupyter notebooks for reports, but found that reports were more effective when they focused on the results and summarized the technical details.
We use GitHub for version control across our data science and engineering teams. For research work, data scientists are able to save and share Jupyter notebooks, which enables collaborating on modeling projects. For production work, data scientists submit pull requests that are reviewed by members of the engineering team. This is the first team I’ve worked on where there’s been a strong culture of using version control for both research and production.
One of the challenges that we’ll face as we grow the team and the rest of the company is being able to find past analyses and experiment results. We’ve already been using Confluence for documentation, such as schema definitions, and have created a space for a knowledge repository. We are using this space to archive past reports and making it searchable by adding summaries, tags, and thumbnails. As scale the team, we’ll look at using tools such as Airbnb’s Knowledge Repo for this task.
Any data science team is going to need to do a bit of data wrangling, which involves moving data between different stores, cleaning up the data set, and creating aggregate tables. We use google sheets for some data munging tasks, which can be used as external tables in BigQuery. The primary tool we use for automating ETL work is the BigQuery command line tool. As the team grows, we’ll look at setting up more tooling for ETL, such as using Apache Airflow.
Expanding our Toolset
As we scale up our analyses, we’ll need to use some additional tools to accomplish data science tasks at Windfall. One of the areas where we haven’t set up much infrastructure yet is using Spark, which would be useful for EDA tasks that need to operate on huge data sets. Some other areas for improvements that I’ve already mentioned include using JupyterHub for direct collaboration on notebooks, and using Apache Airflow to automate and maintain ETL tasks. We’re also looking to set up tools for data governance, and are hiring a data scientist that will help build our data science stack.