Enabling Data Insights Through Slack and AWS
Introduction
Within the Data Technology space in Disney Media & Entertainment Distribution (DMED for short), new data sources are constantly being onboarded onto existing data ingestion processes. With this constant consumption of data comes an ask for analysis into key insights that can be gleaned from those datasets. While doing such is very important (after all, knowing the data from the inside-out is the only way to be an expert in the area), engineers on the team started to wonder if it was possible to automate such a process.
Around the same time as brainstorming a solution began, the team was also focusing on a migration to Databricks. While Databricks is comparable in most areas to the previous tool being used, its ability to easily support a variety of visualization libraries puts it in a league of its own. Engineers quickly realized that they could take advantage of this and use those plots to represent data insights in a very easy to understand manner.
Solution
In order to automate this solution, the team looked to Slack as well as AWS to handle all infrastructure needs. The architecture is as follows:
To summarize the above figure, the team schedules Databricks notebooks (backed by GitLab for easy collaboration) to run the necessary queries and uses the Plotly library in Python for proper formatting of those graphs. The graphs are then saved to DBFS and exported to S3. From S3, bucket notifications are configured to an SNS topic, which feeds into an SQS queue.
The SQS queue’s messages go to a Lambda function, which figures out the S3 folder from which the graph was saved to and sends to the proper Slack channel over the Slack API (using an OAuth token for the API that is stored in Secrets Manager). The Slack message is formatted by a JSON payload giving a description to the query being visualized, with that payload being stored in a separate location in the S3 bucket. The Lambda function is Python-based and stored on an ECR image.
Besides building out the entire infrastructure and writing the insights, the only other work needed was creating a separate Databricks cluster for these jobs due to write access permission restrictions on the day-to-day cluster. By default (and for good reason), the ad-hoc cluster can only write to S3 buckets in the Dev environment. As Prod access was also needed in this case, a new Databricks cluster, backed by an instance profile and federated role with the proper access, was built. Also worth noting is that the base Databricks image did not have Plotly and related libraries installed, so a new Dockerfile was used to create the Databricks image being used for this work.
Here’s an example one of the plots this process would generate:
Next Steps
While this project is still in its early phases, the team is looking forward to the next steps of the process, which are:
- Brainstorm the insights being reported for our datasets
- Putting checks into the Lambda function (or on the Databricks end) so that no unintended data gets leaked into the output.
- Finding a logical schedule for the Databricks jobs so as not to overload the Lambda function.
Acknowledgments
Acknowledgements for this initiative go to Carmen Nigro and Phil Arena for helping in the implementation, as well as JP Marimuthu for bringing the idea to the table in the first place.