5 Easy ways to load data in AWS Redshift using Serenytics

Adrien Auclair
Serenytics
5 min readApr 9, 2018

--

Serenytics uses AWS Redshift as an embedded data-warehouse. Here are 5 easy ways to load your data in it:

  • In one click from a CSV file
  • From CSV files on a SFTP server
  • From any data source with an ETL step (Extract Transform Load)
  • With the Serenytics Python client
  • With a REST API call to push events

1. In one click from a CSV file

If you have a 10Go CSV file, that’s the easiest way to start analysing its content.

Create a data source of type Big CSV, upload your file and you’re done. Your CSV data has been loaded within a Redshift table.

The issue with this process is that it is manual. You can’t use it to update your data automatically each night.

2. From a CSV file on an SFTP server

If your data is exported each day from one (or several) software(s) to a SFTP server, you’ll be able to import it in Serenytics in a few clicks.

First, you need to create an empty table to receive the data. In the data menu, create a Storage data source:

Then, in the Automation menu, create a new Task, choose Load a CSV (or CSV.gz) file from a SFTP server into a Serenytics storage:

In the following form, configure the SFTP server parameters and the filename to import (or a pattern):

In the execution tab, click on Run Now to execute the import, or schedule it:

And you’re done, every day, at 2am, the file will be loaded from the SFTP server into the Serenytics data-warehouse (i.e. AWS Redshift).

3. Using an ETL step

This feature lets you extract data from any datasource (or any join of multiple sources), transform it and load it in Redshift. This is useful to wrangle your data, to clean it, compute advanced insights… As soon as your project is getting more complicated (more sources to combine, more KPIs to compute, advanced analysis), you will use it a lot.

As above, you first need to create an empty storage in the datawarehouse. Then, in the Automation menu, create an ETL graphical step.

This steps let you configure:

  • input source: the origin of the data
  • output source: where the data will be loaded (it is still named a source, as it is a source for the data-visualisation)
  • the transformation you want to apply to your data before storing it in the destination.

In the right pane, you have a preview of the result of your transformation:

This step is powerful as you can:

  • Take input data from any Serenytics datasource: another storage (i.e. an AWS Redshift table, any join, an XLS file, any SQL server table, Google Analytics…)
  • Aggregate your data according to dimensions
  • Select one or several columns to the input data (i.e. the metrics)
  • Select computed columns defined in the input data (to enrich your data)
  • Add filters to select only a subset of the rows
  • Rename the column names (just click on the preview column headers)

As in the method 3, in the Execution tab, you can execute this step or schedule it.

4. With the Serenytics Python client

In case the above methods do not fit your needs, you can load a Redshift table in Serenytics using our Python client (and run your Python code directly within our platform). This can happen if:

  • you need to write your own connector. For example, this happens if you get your input data from an API that is not standard (your Python code will query the API and load the result within Serenytics).
  • your data is the result of an advanced algorithm in Python (your Python script run in Serenytics usually gets data from an existing source, apply some computation and store the results in a new table).
  • you want to automatically push data to Serenytics (and do not want Serenytics to pull data). In this case, you need to run your Python code on your server.

We have several functions in our client to help you load data into a Serenytics storage (i.e. a Redshift table), depending on the format of your data:

  • reload_data(): loads a list of dict (each dict being a row)
  • reload_data_from_array() : loads a list of column names and a list of list for the data rows
  • reload_data_from_dataframe(): loads a pandas dataframe
  • reload_data_from_file(): loads data from a local CSV file
  • update_data_from_file(): inserts new data from file and replace rows with same primary key as in given file.
  • push_data(): appends one row of data
  • batch(): runs insert/update/delete operations on the source using

(full documentation about each function is available here: https://doc.serenytics.com/developer/api_reference/python_client/#datasource)

Here is a simple snippet to load data from a Pandas dataframe:

To run it, create a new Automation of type Python script. Paste the code above:

In the Execution tab, you can run your code or schedule it.

5. By pushing messages

This method is useful to store events (e.g. user clicks on buttons).

You first need to create an empty storage in the data-warehouse. In the configuration of the datasource, there is an URL field you can use to push message with an API call.

You need to do a POST call on this URL to add one row in the data-warehouse. Here is an example in Javascript:

Once the message is received by the Serenytics server, it may take between one to 5 minutes to be available within the storage. In the data preview, you can see that our server added two columns, one timestamp and one message uuid:

In Python, you can use the requests library do achieve similar steps:

Conclusion

These 5 ways to load data in our embedded AWS Redshift are a good illustration of Serenytics philosophy: done in a few clicks for most use cases, but when you need, you can code to customize.

--

--

Adrien Auclair
Serenytics

Serenytics Founder - Planorama Founder- PhD in Computer Vision - Entrepreneur & coder