Extracting Salesforce Data at Octane

--

The Data Engineering team at Octane is responsible for providing the infrastructure and tools to take the data generated by our business operations, systems, and services and make that data discoverable, accessible, and insightful. As of summer 2021, the team comprises five engineers and one data analyst supported by a product manager and an engineering manager.

At Octane, with the fast growth of our lending business, our lending and servicing operations required additional workflows in order to effectively do their jobs. So we decided to use Salesforce creatively to help with some of those needed internal workflows. However, with this new tool in play, it was soon necessary to integrate data generated by our sales and operations teams inside Salesforce with our data lake. It would enable us to produce complete and accurate insights and reporting for various stakeholders within and outside Octane.

This past spring, the Data Engineering team began brainstorming the best way to accomplish this goal. In this blog post, we want to share our findings of the various options available for Salesforce integration in 2021, our approach in evaluating these options, and our collaborative decision process. This is part one of a two-part series. The second part of this blog post will dive into the details of knowledge we gained and challenges we encountered while integrating the Salesforce data with the approach we’ve chosen.

Data Pipelines Overview

Like all other engineering teams at Octane, the Data Engineering team operates mostly in the Amazon Web Services (AWS) environment. We own a number of nightly batch data pipelines, including MySQL, PostgreSQL, and Salesforce, which extract operational data from stores on and off AWS.

Our current pipeline architecture roughly follows the Extract, Load, Transform (ELT) pattern. We run batch processes to “extract” data from our various sources and “load” to a staging area, and then “transform” the data for usability in our data lakehouse. After that, various data products are built on top. These data products are persisted in Parquet format on S3 and then made available to our community of internal users via AWS Athena, Redshift Spectrum, or directly on S3, all utilizing AWS Glue as an underlying metastore.

With Salesforce, we didn’t have direct access to the underlying Salesforce Oracle databases, so we instead had to rely on tools made available by Salesforce directly or third parties to help us with our goal. Salesforce as a new data source didn’t fit our established extraction pattern into our staging area. Our challenge was to select the best approach to balance the following need: make raw and processed Salesforce data available in our data lake in a timely, scalable, and maintainable fashion, preferably without significantly affecting our current ELT architecture.

Decisions, Decisions

To solve this challenge, the team evaluated multiple approaches of ingesting Salesforce data, from consuming the Salesforce bulk REST API directly to fully managed paid solutions like Stitch. Our evaluation list consisted of:

After some research, we narrowed this down to a short list:

  • Direct Salesforce Bulk API consumption
  • AWS AppFlow
  • Meltano
  • Stitch

Considerations that eliminated other options varied from nonavailability of S3 as a “target,” complexity of workflows, and lack of publicly available documentation. Notably, Airbyte was a strong choice, but unfortunately didn’t have S3 as a target available at the time (Airbyte has since added an S3 connector).

We further evaluated these options and discussed potential architectures for each, considering the effort that would be involved to implement the solution and, more importantly, to maintain it long-term within our existing data infrastructure and environment. Below are key summaries of each option we considered and the final selection we chose and implemented.

Direct Salesforce API Consumption

The default “bare” solution was to query the Salesforce Bulk API directly. Here, the two suboptions were 1) sticking as closely as possible to our existing extraction architecture and simply performing data pulls from ephemeral EC2 instances, or 2) adopting a more serverless approach with Step Functions. The team was already familiar with the first option, so we focused our evaluation on the serverless approach. Figure 1 shows an approximate proposed solution with this approach.

Figure 1. Direct Salesforce Bulk API consumption approach, a serverless architecture.

Regardless of the suboption, with direct Salesforce API consumption the team also considered the additional upfront engineering work needed to reliably consume gigabytes of data per night, with parallelization, Salesforce throttling, and Salesforce-specific bulk chunking rules in mind, and in light of our project delivery timeline goal.

Stitch

Stitch is a fully “hands-off” SaaS solution allowing one to easily set up scheduled pipelines with a few button clicks in their UI. Stitch extracts from more than 100 different sources, including Salesforce, Google Analytics, etc., and loads to supported destinations, including S3 and all major cloud data warehouses.

We were able to set up a full, scheduled, repeatable Salesforce data extraction pipeline to an arbitrary Octane S3 bucket in the production account in under 30 minutes.

The price of that convenience starts at $1,000 per year for up to 5M extracted rows per month, and goes up with the amount of rows in absolute terms (but goes down in relative terms of cost per row) to e.g., 50M rows per month for $5,000 per year. Additional perks are unlocked at “Enterprise” level, or above 5 registered users, at a presumably higher price.

Stitch uses open source Singer for its connectors and touts its extensibility and open source. Stitch is also the official sponsor of Singer, and it’s mostly Stitch engineers who contribute to Singer. Singer is the primary open source library around which other SaaS ETL startups and ETL tools wrap their own connectors and targets.

The downsides we considered included security and privacy; challenges of integration within our current overall pipeline architecture, especially notifications of completion; and non-open source and opacity of the process, among others. With the nature of our business, we must be SOC 2 compliant. While Stitch does claim SOC 2 compliance in regard to customers’ data, that still meant additional annual auditing was required by our Security team, adding to our annual cost basis with this solution.

AWS AppFlow

AWS provides Salesforce integration with AppFlow. This integration is similar to Stitch but with fewer features. The upside is obviously that it’s managed by AWS in the same account, so concerns raised about Stitch regarding extra auditing load on the Security team don’t apply. A walk-through blog post for this service is available here. Some notable characteristics for AppFlow include:

  • UI driven configuration, some limited API available.
  • No Terraform support yet, but there’s CloudFormation support.
  • Manual creation of a “flow” (integration) for each object.
  • No support for incremental extracts as far as we can tell.
  • Very limited configurability (e.g., destination S3 keyspace naming, see next bullet) and doesn’t seem to play well with Glue crawler or Athena.
  • Non-configurable S3 keyspaces. AppFlow injects its own keyspace (flow execution ID) inside a daily keyspace. Combined with the lack of incremental extracts, there appears to be no compelling reason for using daily S3 keyspaces.
  • Discarding all data type metadata, encoding everything as a STRING.
  • Easy, fully managed service, full raw Salesforce extract can be set up through UI within hours.

A viable serverless, event-based extraction architecture with AppFlow is shown in Figure 2.

Figure 2. Salesforce staging architecture with AWS AppFlow.

The downsides of using AppFlow were similar to Stitch, albeit much less pronounced. Since all of our infrastructure is already on AWS, integration within our larger data pipeline architecture was trivial and no additional auditing burden on our Security team was required. However, AppFlow was still an opaque, non-open source solution not under our control, which is always a consideration with managed solutions. In addition, there were other minor annoyances, such as lack of control over the exact S3 target keyspace (AppFlow always added an additional autogenerated keyspace “directory” at the end of the specified target keyspace).

Meltano

Meltano is an open source CLI tool, which like Stitch, also wraps around Singer connectors to expose its own “extractor” and “loader” interfaces, but in a much more pleasant way than using Singer configurations directly. Unlike Stitch, Meltano is not a SaaS offering, but a CLI tool to be run on your own infrastructure. Some key considerations are noted below:

  • Meltano provides an intuitive and easy CLI-based management of Singer taps and targets, and more importantly, its own easy YAML-based configuration wrapping uglier Singer configuration requirements.
  • Meltano supports containerization for production deployments.
  • Meltano can be run as part of a Docker container on EC2 instances that we spin up during extraction pipelines.
  • It’s open source and free, with development bankrolled by Gitlab, which spun it off as an open source project.
  • Exposing Meltano as a library dependency in addition to just being a CLI tool is on the development team’s radar.
  • Unlike Stitch, using Meltano allows for full control over pipeline architecture and infrastructure.
  • Since Meltano just uses locally stored Salesforce authentication configuration, no Salesforce access to any third parties is provided.

However:

  • Full control means full responsibility for a correct implementation.
  • It’s a CLI, so calls to it will need to be shelled out of Python as part of extraction pipeline execution, and stdout and stderr fed back in.

Figure 3 shows the architecture we considered with Meltano, which was very similar to the architecture for direct Salesforce API consumption on EC2 instances (in light of our already existing ELT architecture).

Figure 3. Proposed staging architecture with Meltano.

Final Decision

In the end, after careful consideration and a cross-team deliberation process, the team voted to implement the architecture proposed in Figure 3 and to use Meltano as a tool to extract Salesforce data.

The factors that played a key role in the decision were:

  • Meltano is an open-source project, which allows for full control and full visibility into how our data moves.
  • The EC2-based architecture fits closest with our current framework and requires little architecture alteration.
  • The Meltano tap-salesforce connector abstracts low-level interactions with the Salesforce API and makes extract job configuration easier than consuming the API directly.
  • Meltano is built on the Singer spec, supporting the ingestion of other data formats in the future should the need arise.

Since then, we have implemented and released integration of Salesforce data into our data lake using Meltano. Be on the lookout for our follow-up post on our experience implementing Meltano into our architecture!

--

--