Re: dash — Python Data Source

Published in

Geek Culture

5 min readAug 18, 2021

Redash is an open-source free BI tool that users can easily set up on their machine using docker. It is very lightweight and easy to use for analyzing data across multiple data sources. Redash v10-preview version is out and you can try it using the below tutorial.

Re: dash — Open Source BI tool

Old days of querying the data on a black terminal or pulling it into an Excel sheet for analysis are long gone. Data…

medium.com

Redash comes out of the box with 20+ data sources, it can connect to any open source or cloud database easily by using the data source UI. However, there are some other data sources available in Redash that can be used to pull data from URLs or APIs. These data sources are not enabled by default and are also recommended to be used on personal systems only due to security reasons.

We will be discussing on How to use Python as a data source to achieve it in this article.

Supported Data Sources

Edit description

redash.io

Use Case

As engineers or analysts for ad-hoc data analysis, we often have to add a decoding layer on top of our data before sharing or presenting it on an end-user dashboard. For example, a city_id = 1 will never make sense to anyone viewing the dashboard or a complex visualization. Adding the mappings or dimensions data set to the back-end data source is not a good design option in such cases as the data set will no longer be needed after the analysis.

Secondly, there will be an extra overhead of adding and removing the data set from a production system which can take days to weeks depending on the schedule of your data team, who are most keen to avoid one-off requests :)

Python Data Source

In such a scenario, We need a data source that can read the data from an external data source and also has the flexibility to manipulate the result set to join with other data queries.

Python data source in redash gives the users this flexibility to query the external URL/APIs and pull the one-off decode files for ad-hoc analysis. The current data source does not support global python functions but it gives a wide range of functionality through some predefined functions, default print function is also available for easy debugging of the scripts.

How to enable?

By default python is not available as a data source, users need to update the redash.env file and add the below line while setting up the Redash instance. this option is only available in the self-hosted instances or users can check with their Redash admin team for enabling it.

REDASH_ADDITIONAL_QUERY_RUNNERS=redash.query_runner.python

If you have a redash instance already running, please restart your celery workers after setting the parameter.

How to Use?

Redash requires every query to return a dict type result object in python for visualizing the records on a table or graph format. All the built-in query runner classes for databases return a dictionary object for every query and the format looks like below.

Result object format :

result = {
    "rows": [
        {"id": 1, "value": "s2f1"},
    ],
    "columns": [
        {
            "name": "id",
            "friendly_name": "id",
            "type": TYPE_INTEGER,
        },
        {
            "name": "value",
            "friendly_name": "value",
            "type": TYPE_STRING,
        }
    ]
}

Every Result Object contains two keys

rows key contains the list of n number output records for the query. Each row element can contain n number of columns in the output. Each row element has to be in JSON format.
columns key contains the schema for the records. Each schema field has to have two values name & type of the field. Python query runner comes with predefined data types for the result object, mentioned below

# Supported data types
- TYPE_DATETIME                                   
- TYPE_BOOLEAN                                  
- TYPE_INTEGER                                   
- TYPE_STRING                                   
- TYPE_DATE                                   
- TYPE_FLOAT

So in the case of python data sources, users have to format the output of their python script and return a formatted result object.

Functions:

Python query runners also support a group of functions for reading, updating result objects from other queries. Users can read the output of other queries and can join the output from multiple queries using python.

## Return the result Object for a predefined query
get_query_result(<Redash Query ID>):                               ## Print Table Schema Information for a data source  
get_source_schema(<Redash Data Source Name or ID>)  
                                
## Execute a query and return result object against a data source in ## python
execute_query(<Redash Data Source Name or ID>, <SQL QUERY>)                                  ## Add new column to the result object
add_result_column(<result object>, <field_name>, <field_name_for_displauy>, <field_datatype>)## Add a new row to the result object
add_result_row(<result object>, <values>)

Example

We will pull a file from a URL and will create a data set in redash to use the file as a mapping file.

Create the Python Data Source & Add all the modules you need for your script.

Create a Python Script and pull the file from the git hub.

Execute the script and get the output. it will print all the logging information we added using the print function with timestamps and it can help you to capture the time taken on each step.

Functions Example

Getting results of an old query and adding a new record to the result set.

Python data source is not limited to pulling the data from URL or APIs only, it can be used in many ways. it will not feasible to list all of them under the scope of this article. Please try out all the in-built functions and let me know if you face any issues. you can get the code for the examples on the below link, I will try to add more stuff in future.

Thanks for reading, Have a nice time !!