API Data Ingestion at Glassdoor

Published in

Glassdoor Engineering Blog

5 min readOct 9, 2021

Data is crucial to the ongoing operations of Glassdoor. We use insights from hundreds of pipelines, refreshing thousands of datasets from all kinds of sources, all orchestrated through Apache Airflow. As a result, there are some natural obstacles in the management of so many disparate APIs. Fortunately, Singer.io helps us consolidate and manage many moving pieces with a single uniform solution that smooths out the data ingestion process and ensures a manageable workflow for our data teams.

As data engineering teams know, data extraction from APIs poses a few challenges: namely, a lack of uniformity in notifications about schema changes, broken data, availability problems, and version control issues. Differences in integrations built by multiple developers further compounds this. These challenges can make debugging a failing API integration a delicate art. Singer helps alleviate the complexity by abstracting the data ingestion process. It normalizes and formats the API responses in such a way that Glassdoor engineers can operate confidently in the face of unexpected changes in the incoming APIs across unfamiliar data pipelines.

Singer is an excellent fit for our ingestion process and evolution of our coding standards because it (is):

Abstracts some layers from the ETL (by use of Taps and Targets)
Cloud agnostic
Integrates into the Airflow ecosystem
Standardizes data output (as JSON)
Highlights errors through extensive logging

Taps and Targets

Singer abstracts data ingestion using Taps (data extraction scripts) and Targets (data loading scripts). A tap represents a single configurable unit within the overall ingestion environment. Taps typically wrap the response from an API into a JSON record with dedicated fields for “type”, “stream” and “record” for the data associated with the Tap.

Note that this data is wrapped within the “Record” key, and allows the value in this key:value pair to change and shift over time. With this data structure, changes in schema do not break the ETL process during data extraction, shifting the responsibility of understanding schema changes from extraction to transformation, and allowing for a more reliable inflow of data for any operations.

Taps typically require:

A discover command that shows all the object schemas that the tap can extract
A config file that is responsible for what is necessary for the API connection to be successful
A properties file that dictates the object and schema of the object to pull

Taps typically output:

Ingestion (pagination) state
A bookmark for possible API issues and restarts
Logging, including failure notifications that the developer included during tap creation.

Targets are prebuilt frameworks for persisting data that Taps extract. For example, target-csv allows data to be persisted as a CSV file in the local file system.

Understanding a typical Singer Workflow

Singer is integrated into Glassdoor’s Airflow ecosystem through a single operator. The operator allows developers to specify things like Python version information, file locations, and bash commands. A typical Singer task runs a Tap’s bash command for an object, creates a JSON file with data and moves the file into S3. Once in S3, downstream tasks perform the actual data processing.

Here is an example of running a tap bash command:

Schema discovery

Singer’s Discover feature allows detection of schema changes by building the catalog from the API itself or JSON schemas built and shipped within the Tap. Upon running a discover command, the schema/catalog for the API is generated. Downstream tasks within or outside of the operator can use the catalog filter for specific objects and pull data from the API.

At Glassdoor, we run schema discovery commands regularly. This allows us to generate a robust log of schema changes over time that can be referenced for future troubleshooting.

Here is an example of how to run the discover command:

Data filtering

Our Singer operator uses template files to filter fields from schemas. Template files have the ability to specify filtering using: direct spelling or RegEx, inclusion and exclusion.

Here is an example template file:

Here is an example leveraging the Template:

Filtered data is piped directly as JSON to storage. The operator can use any of Singer’s targets for this purpose.

Managing multiple Taps

Different taps sometimes require conflicting versions of the same library. Adopting virtual python environments enables multiple taps to operate simultaneously with Airflow, with no conflicts. Airflow’s connections feature can be used to set virtual environments information.

Singer and Airflow’s great chemistry extends to (but is not limited to):

Using Airflow’s Jinja templating for Singer’s config files (e.g. specify event start and end times for data processing)
Storing login needed by Singer for API authentication in an encrypted manner within Airflow’s Connections.

Conclusion

The adoption of Singer has accelerated API data ingestion, improved ingestion standards, and made our on-call support easier through consistent logging. Glassdoor adoption started within our marketing organization, where over ten new sources were integrated within a few months. Since then, Singer has quickly gained traction across our org.

We are excited to see Singer evolve and help other open source practitioners. If you’re interested in learning more about Singer, check out their excellent documentation here, and if you’re interested in learning more about Airflow, their documentation can be found here. Leave a comment if you use Singer and share with us how you use it!

Authors: Chris Fiegel, Pras Srinivasan, Fro Umel