Salesforce to Redshift schema translation using Python
As businesses become more data-driven, there is an increasing need to collect and analyze large amounts of data from on-premises and cloud applications. A popular IT demand is to extract Customer Relationship Data (CRM) data from Salesforce and load it into the Amazon Web Services (AWS) Redshift data warehouse.
There are various methods for loading Salesforce data into Amazon Redshift. One approach uses extract/load/transform (ELT) processes. In this process, the Bulk API can be used to extract data from Salesforce to s3, and then a Redshift table can be created using the data stored in S3. During this process, you may encounter problems ingesting Salesforce data types as-is into Redshift as it does not support most of them. This article will help you to work with Salesforce data types and convert them to Redshift equivalent data types.
Salesforce Data Types
Each field in Salesforce has a specific data type. When you create these fields in Salesforce, you must select a data type such as String, TextArea, Number, PickList, Id, Reference, URL, and so on. Due to incompatibility, we cannot insert most of the Salesforce data types as-is into Redshift or any other data warehouse.
To create a Redshift table with relevant schema, you must manually map all these data types to equivalent data types in Redshift. And just imagine if you have many objects with more than 500+ columns; it will be tedious.
Proposed Solution
The below script will auto-create Table DDL or a typecast query as per your requirement.
Prerequisites:
- Python simple_salesforce module
- Salesforce credentials with API access
Step 1: Setting up the Eco-system
The first step is to install simple_salesforce. With the help of simple_salesforce, you can gain access to Salesforce and its REST API.
pip install simple_salesforce
Step 2: Establishing a connection to Salesforce
Create a new Python file and import simple_salesforce. You also need your Salesforce credentials, as shown below code.
from simple_salesforce import Salesforce
sf = Salesforce(
username='salesforce username',
password='password',
security_token='token')
Step 3: Evaluating Salesforce schema
To describe an object from Salesforce, you need to use the .describe()
method.
sf.{objectname}.describe()
returns an ordered dictionary of attributes on the specified object, including a list of all fields and their attributes. It returns the field name, field label, field type, etc.
desc= sf.Account.describe()
Step 4: Mapping Salesforce data types to Redshift data types
You will require only a fields
dictionary from the above dictionary and write code to generate Redshift equivalent data types for specific fields.
field_names = [field['name'] for field in desc['fields']]
field_types = [field['type'] for field in desc['fields']]
field_length = [field['length'] for field in desc['fields']]
field_precision = [field['precision'] for field in desc['fields']]
newDict = {}
i =-1
#Based on the Salesforce data type, you need to assign the corresponding Redshift/SQL/Athena data type.
for f in field_names:
i += 1
Redshift_datatype = ""
if field_types[i] in ('phone','id') and field_length[i]> 0:
Redshift_datatype = "varchar("+str(field_length[i])+")"
elif field_types[i] == 'int':
if field_length[i]> 10:
Redshift_datatype = "Bigint"
else :
Redshift_datatype = "int"
elif field_types[i] in ('string','textarea','reference','picklist','url',
'email','address','location'):
if field_length[i]> 8000:
Redshift_datatype = "varchar(max)"
else:
Redshift_datatype = "varchar("+str(field_length[i])+")"
elif field_types[i] == 'datetime' :
Redshift_datatype = "timestamp"
else:
Redshift_datatype = "varchar("+str(field_length[i])+")"
#Creating a new dictionary with field name and corresponding Redshift datatypes
newDict[f] = {f: Redshift_datatype}
The script demonstrated a few examples of how Salesforce data types are mapped to Redshift equivalent data types. You will need to map all the Salesforce data types to their Redshift equivalents based on your requirements.
Similarly, to convert Salesforce data types to SQL or Athena equivalent data types, add variables like SQL_datatype or Athena_datatype and then assign corresponding data types to them.
Step 5: Generating DDL
Now you have newDict
dictionary with field name and Redshift equivalent data type, as shown below.
newDict = {field_name1 : {field_name1 : Redshift_datatype},
field_name2 : {field_name2 : Redshift_datatype},
.
.
.
field_namen : {field_namen : Redshift_datatype}
}
As demonstrated below, this dictionary can generate DDL for the table or create a cast query in which you can cast string columns to Redshift equivalent data types.
print('CREATE TABLE {0} ( '.format(TABLE_NAME))
for column,datatype in zip(newDict.keys(),newDict.values()):
print('{0} {1}'.format(column,datatype[column]))
Print(' );')
As shown below, DDL is generated with Redshift equivalent data types.
CREATE TABLE Account (
Id varchar(10),
Name varchar(40),
Phone varchar(20),
Address varchar(250),
Amount bigint,
Start_date timestamp
);
Conclusion
With the above utility, you can seamlessly ingest data from Salesforce to Redshift without worrying about data type conversion. SQL or Athena table DDLs can also be generated as we did for Redshift. You’ll have your Table DDL in less than a second, regardless of how many columns the object has.