How to integrate Salesforce data into your Data Warehouse using Scala

A guided walk through the Salesforce REST API, by a Scala beginner 👣

Wesley Batista
6 min readFeb 6, 2018

In this post I want to show you how to export Salesforce data using Scala️.

The motivation to write this article was that writing my first Python script to interact with Salesforce APIs wasn’t that easy. The reasons could be:

  • only Bulk API and SOAP API seemed to be the right choices, for that use case. I had never done anything with that SOAP thing (ewww)(or “Bulk” what???);
  • the fact that I wasn’t an experienced developer;
  • my code design skills wasn’t that good too.

But leaving the past aside, happens that recently, I was asked to build another integration. Now instead of ‘upinserting’ data on Salesforce, we needed to retrieve from there to our Data Warehouse.

As I am also trying to learn more about Scala, and practicing is everything, I thought 🤔 “why not?”. It totally made sense for me :)

Going forward, what we’ll cover on this article in summary:

  1. Authentication
  2. Results Retrieving/Pagination
  3. Gson (oops!), Json parsing
  4. Data dumping

In order to proceed we’ll need:

  • SBT installed
  • A “Connected App” created at Salesforce
  • credentials to access Salesforce

Getting access to the data

Ask for a key 🔑 to request things

We know that one of the most problematic things when building an integration is the early steps when you need to see the data and understand it. So I think that we can agree that handle access to the data is one of the hardest things to deal when building an ETL.

That’s why I learned to keep it simple while I am just getting started.
So let’s use curl to check if we are able to access the API. Here we are assuming that you already have created the “Connected App” as described here, and already have in hands the CLIENT_ID and CLIENT_SECRET.

Interesting thing about the login at Salesforce:

Salesforce provides you a password of 32 chars length, and you use the first 8 to login using the Web UI. But when authenticating through the API you need to use the full password (32 length). I don’t remember very well, but I think that they inform you about that when the user is created. Either way, It’s is a tricky thing to keep in mind and hard to remember if you are not working for the platform on your day-by-day.

All that said, run the following code in your command line.

It should print out a Salesforce access token.

Starting the project

As a programmer coming from Python, one of the first things I’ve quickly got was that with Scala everything starts with sbt. It’s everywhere. So let’s create a fresh new project, using the scala-seed template, as mentioned in the oficial sbt documentation.

sbt new sbt/scala-seed.g8

This template comes with a “Hello World” example. But here we are one step above, so we’ll ignore the content from the template and just preserve the folder structure. In the next section we’ll get deeper in the code and logic.

As this is a very technical article, get prepared to read code. Below I’ve embed the main files, but you can find them all also at the github project.

Understanding the project files

At first, take a look at the folder/files structure, so you can have a better idea on how we’ll proceed.

I recommend to start from here (the salesforce.conf file) so we can already put a check on the Authorization step.

salesforce.conf

Copy /src/main/resources/salesforce.sample.conf as /src/main/resources/salesforce.conf and replace the sample values with real ones.

ObjectExporter.scala

It’s serves only as the entry point to our code.

SObject.scala

This is the main file here. It has all the logic regarding the API requests, responses parsing and file output. I added comments at the most important parts describing what we are doing :)

Util.scala

Here the code is almost the same as provide at the documentation example. It’s a class that has only one method called getAccessToken, which request a token to do the authorized requests to the API.

ResponseTypes.scala

This file has the case classes needed to handle the API responses using the Gson library.

Build and Running the script

To build I usually use the sbt-assembly plugin, that way we can have packaged all the dependencies into one single .jar file.

Follow the plugin setup instructions, reload/restart sbt shell as needed, then, inside the sbt shell run assembly.

After the build, inside the SBT shell you can run:

run com.example.salesforce.ObjectExporter Account Account.json

Or in the command line:

java -cp /absolute/path/to/assembly.jar com.example.salesforce.ObjectExporter Account Account.json

It will print some debug messages while in the sbt shell, and also because we have some println calls on the retrieveRecords method, in the SObject.scala file.

The script may hang for a while, depending on the internet connection and the size of the data on Salesforce. At the end, it will print a success message and we’ll have the output file written.

Finishing

Our approach here is not the standard ETL (Extract, Transform and Load) way (but the ELT). We do not have a “transform” step. It is just not needed for the purpose. We opted out to manipulate the data from the data warehouse.

Also, we will not cover the “load” part in this article because it can vary depending on your choose of database or data lake implementation.

The output of our script it’s a standard New Line Delimited Json file, so you should have no big problems with that.

But one thing that can be a daunting is to map the Salesforce types to your database types.

If you’ve reached this point, the following code will help you in the schema creation.

The way we are doing is not the clever way, either sophisticated, but meets the needs at a certain level ¯\_(ツ)_/¯

var schema = {
var result: Array[String] = Array("")
val response = describe
val describeResponse = gson.fromJson(response, classOf[DescribeResponse])
result = describeResponse.fields.map(x => {
f"${x.name}: ${x.`type`}"
})
result
}
def dumpSchema = {
val result = schema.mkString("\n")
.replace(": datetime", ": STRING")
.replace(": date", ": STRING")
.replace(": reference", ": STRING")
.replace(": picklist", ": STRING")
.replace(": textarea", ": STRING")
.replace(": string", ": STRING")
.replace(": phone", ": STRING")
.replace(": email", ": STRING")
.replace(": id", ": STRING")
.replace(": url", ": STRING")
.replace(": multipicklist", ": STRING")
.replace(": address", ": RECORD")
.replace(": int", ": INTEGER")
.replace(": currency", ": FLOAT")
.replace(": double", ": FLOAT")
.replace(": percent", ": FLOAT")
.replace(": boolean", ": BOOLEAN")
println(result)
}

Now what we need to do is to call the method dumpSchema that we just defined to see the field type mapping.

--

--