Most business intelligence projects comes to help management view summarized data from company’s sources (applications etc.) and make better decisions as managers and decision makers.
In this case data comes from several web/mobile medical applications, on different topics, languages, media types, countries, units, etc.
Special case warning: Data can’t be reached by the administrator as it holds sensitive private medical data, so it must be anonymized before published (a subject for a different post).
- Collect the data to AWS S3 bucket using cron job script (not AWS Kinesis-Firehose or Kinesis-Data-Streams).
- Invoke AWS server-less lambda for indexing metadata of the data-lake.
- Creating a new database using AWS GLUE Crawler.
- Use AWS Athena to query and add new tables to the database and analyze data in the lake.
- AWS Quicksight business intelligence (BI) application to perform data analysis and visualization.
Collect Data and Migrate to AWS S3
In this article we discuss a database that you cannot dynamically reach, due to private medical data it holds. This means that it’s not allowed to give external entities the DB credentials, moreover, we can’t POST our data to Amazon Kinesis streaming or Amazon Kinesis Firehose for capturing the data into the data lake set. The workaround for me was to build a passive, not event driven, cron job that triggers the applications servers (django in my case) to send an anonymized json file to AWS S3.
Meaning we had to anonymize DB and passively (cron job) send it to AWS S3.
Here below is a glimpse of the script, at first we json-ize the whole DB and then we send it to our S3 bucket.
Indexed Metadata Data-Lake.
At this point we have a json file exported from the servers and we want to build a data lake. How do we do that ?
My straight forward approach was to separate the DB tables to stand alone mode (no connections between tables) and rewrite the data into separate json files inside another bucket also in separate folders as AWS Glue likes it that way. For this task I am using a simple AWS Lambda that triggers every time a big DB file enters into the bucket and divides it into table-like json files in separate folders in another bucket.
AWS GLUE Crawler.
“AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores and data streams.” from AWS docs.
Glue fits like a glove, we got our json data, now we want to move it into a new DB schema that is open for us to do some BI queries.
Here is the Glue crawler I configured, its crawls on our S3 bucket and for each “folder/table.json” we created in the previous section using lambda we now create a database table.
This is real-life ‘checkbox’ programming as I call it. You just pick what you need and it works right away.
Here is the Crawler info:
Crawler info: Name DB-s3–2-athena
Data store S3
Include path s3://buckertname/
IAM role arn:aws:iam::#ID#:role/service-role/AWSGlueServiceRole-s3
Schedule Every 4 hours, Monday through Friday
It runs every 4 hours and the new DB created is updated with our fresh data and looks something like this (for privacy reasons it is blanked table names but I hope you get the point- creating new db with new tables).
AWS Athena, Query The Data Lake.
Amazon Athena is an interactive query service that makes it easy to analyze data. First, we need to connect to the crawler in AWS Glue in order to retrieve database information, this is done in the data source tab (see below).
Now we got all tables accumulated in the Athena database we built, as they were in the original exported from our servers (except for personal info we anonymized) and….
We can query the data now!
We may create new tables if we want (example below).
We can visualize the data with nice graphs.
The Athena queries should be holding all tables we declared in our s3 folder division schema, in a way that every folder in the S3 bucket creates a table in the Athena.
Let’s query data and create new data tables from these queries.
In the screenshot below most of the data is blanked out, but the point is that a new table created “my_new_table” from a simple sql JOIN between “survey” table and “user” table and ”hu” table.
As you can see the new table has been created and we can query it !
Migrate To Quicksight AWS BI tool.
“Amazon QuickSight is a fast, cloud-powered business intelligence service that makes it easy to deliver insights to everyone in your organization.”
After we have built our queries and built new tables respecting those queries, we can import them to Quicksight and view it graphically.
First we need to connect Quicksight to DB (Athena):
- Choose a new analysis, then choose a new dataset, next, is the “Data source name” the DB name we created in Glue and elaborated in Athena.
- Pick a table you wish to view graphically:
3. Choose “visualize”and go to the “new analyses”, proceed and easily build graphs and charts.
We just saw how to create a data lake and made first steps in analyzing the data for BI purpose. All these tools are powerful and are the basic for BI, AI and ML. A few years ago, who would have guessed it could be done without one real computer or one SW license.
Disclaimer (I did not discuss):
- AWS permissions
- Schedules / Dynamic data triggers.
- Medical data anonymization.