Serverless BI: A data-driven path to digital transformation
Adopting a Serverless approach simplifies an organisation’s path to data-driven Business Intelligence.
A full commitment to being data-driven can benefit organisations of all manners and sizes. A data-driven approach allows large companies to achieve effective digital transformation, whilst allowing smaller companies to productize their valuable data and effectively leverage the promise of artificial intelligence (AI). The key benefit of being truly data-driven is that it empowers any organisation to make the right decisions at the right time.
Data-driven organisations need every stakeholder at every level of the organisation to have access to the data they need, when they need it. Sometimes access is as simple as knowing set metrics (e.g., number of orders, revenue or conversion), and other times it’s about quickly making discoveries from data with the right tooling.
Developing an effective data-driven strategy often requires Business Intelligence (BI) tooling. BI is not a new concept; it has its origins in trade for centuries, crossing into the sphere of technology at IBM in the late 1950s. Today, Business Intelligence has become a widely recognized approach among management practitioners. That being said, according to NewVantage Partners’ 2019 Big Data and AI Executive Survey:
“69% [of executives] report that they have not created a data-driven organization”
The reasons many executives haven’t been able to transform their organisations into being data-driven are manifold. Some of these are, of course, organisational — larger companies often lack the agility required to transform the process and mindsets for their organisations to become data-driven. There are also many failed transformations due to technology, with existing IT teams unable to create the tooling needed to allow companies access to accurate, reliable and up-to-date data.
Cultural challenges often present a significant hurdle to organisations attempting a transformation to becoming data-driven. Ensuring a smooth BI technological integration can help mitigate cultural challenges — keeping costs low and minimizing business disruption will allow for proper investment into developing a data-driven culture. Further to this point, businesses need to increase flexibility in their business processes; it’s critical that the technology used keeps up with the changes made in those processes. As is the case with any software, the approach needs to be iterative, responding to the needs of the users to increase adoption.
Where does Serverless come in?
Serverless is a broad and polymorphic space, and is rapidly evolving. As a term, it’s widely used, but not well understood. For the purposes at hand, let’s think of Serverless as allowing applications and services to be run without having to manage the underlying infrastructure, as well as using a “buy-not-build” approach to services that have been commoditised.
In a Serverless approach, companies rarely build their own Serverless web applications; instead, they consume them as a service (e.g., AWS Cognito, Okta, Auth0). This leveraging of third-party services, as well as its pay-per-use model, makes Serverless an ideal fit for BI systems and data pipelines, as the Total Cost of Ownership (TCO) and time to market is reduced.
There are security benefits as well. The underlying infrastructure and many critical security functions are managed by the cloud provider, making critical data security simpler and easier to audit and manage.
Data governance concerns can also be simplified with a Serverless approach by employing a repeatable architecture that easily can be deployed across many regions. While Infrastructure as Code is not a new concept, Serverless architectures take it to the next level as more of the application infrastructure is simply the orchestration of cloud resources. The abstraction provided by Serverless services and the increased usage of cloud-native solutions makes multi-region deployments more easily manageable and maintainable.
As an example, while working with a startup revolutionizing the digital therapeutics space for neurodegenerative disease with a team at Theodo, we ensured the entire architecture could be deployed consistently across many regions. This simplified the regulatory constraints on this data, and global data aggregation could be handled in a dedicated pipeline, ensuring the removal of personally identifiable information (PII) and other sensitive data. Further, many of the audits were simplified by the lack of self-managed infrastructure.
Finally, pay-per-use is a natural fit for many BI needs, as reports are run periodically and access tends to be sporadic. This allows cost-efficient solutions for large organisations as well as affordability for startups.
Practical Serverless BI — In 4 Steps
For the purpose of this exercise, let’s take a simple case of an e-commerce company selling books. This company’s existing application happens to have a 100% Serverless architecture on AWS, and until now, the only analytics have been provided by Google Analytics. Other than that, executives have had to hunt down developers to run manual data extracts from the live databases. Let’s call our company “BookLess”.
This example will focus on building a Serverless BI solution for a Serverless application, but similar techniques can be used for more traditional architectures in a hybrid approach. The benefits of Serverless can still be realised in the BI tooling and data pipelines.
Step 1: Storage — The Serverless Data Lake
Where the data is stored is the first concern of any BI strategy. Traditionally, many BI solutions work off traditional relational databases using SQL as their lingua franca, in combination with other less structured sources. Sometimes data is aggregated in a data warehouse. Other times, it’s pulled out of a range of application-specific databases, or from an unstructured collection of data known as a “Data Lake”.
Our company BookLess has a completely Serverless application architecture; therefore, it will not be using a traditional relational database. Instead, all its application data will be stored in a number of DynamoDB databases (a low latency Serverless NoSQL database). As the team has had limited analytics so far, and years of data, they want to get all this data into a data lake for usage by the new data team (that’s us!).
Amazon S3 (an object storage service) makes for a great data lake solution. It’s Serverless, almost infinitely scalable and has mature tooling around security, encryption and integration with other cloud services. One of the concerns of companies adopting newer technologies is the lack of mature tooling, yet services like S3, which happens to be Serverless, already have a mature tooling ecosystem.
🗂 Real World Case Study: FactSet automated DynamoDB data exporting to Amazon S3 Parquet to create a Serverless data analytic platform.
Read the full case study
To get the DynamoDB data into S3 we will make use of DynamoDB streams, Lambda (Function-as-a-Service) and Kinesis Firehose (a data streaming service that can target data lakes, data stores and analytics tools).
Every time a table item is modified in a DynamoDB database, a stream captures a time-ordered list of changes, and these are then picked up by AWS Lambda (which can be batched). The changes can then be sent to the Kinesis Firehose to write them into S3. Using Kinesis instead of our own hosted Kafka again applies the tenant of Third-Party integrations, reducing the TCO.
🗂 Real World Case Study: Woot.com cloud-native data warehousing solution. Replacing 5 year legacy system in just 3 months.
Read the full case study
BookLess also wants all their data in one place, and they want more control and flexibility over their frontend analytics (currently provided by Google Analytics). While they will keep Google Analytics for conversion funnel and demographic analysis, they want similar and other custom metrics from their data lake. Their frontend application, like most today, is a Single Page Application (SPA).
It’s quite a simple job to send their existing events to another analytics provider. Here again, we will use Kinesis Firehose into S3, but this time directly from the frontend client via the Amazon Kinesis Firehose Analytics Provider from the AWS Amplify Library. AWS Amplify provides a big range of functionality, but its library can be used independently to simplify integration with AWS Services from the frontend.
Step 2: Query — Getting Data From The Lake
We now have the data in our lake, and the business stakeholders are keen for insights. As the data is stored in an efficient format for query, thanks to Parquet (an open-source & performant flat data format), we can begin to gather insights using Amazon Athena (an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL).
Typically, querying this data would have required a complex extract, transform, load (ETL) process into a data-warehouse. Although managed ETL services exist — e.g., AWS Glue — this process can involve complex infrastructure management, moving away from our Serverless target. Luckily, AWS Glue is fully managed and has a pay-per-use model, but even in that case, it requires a more sophisticated skill set. For BookLess, we’ll use Athena, which allows simple access to the insights hidden in our data lake until we grow to need more advanced tooling.
Athena is also completely Serverless, meaning we’ve not added any complex infrastructure to manage, cost is per-use and typical SQL can be used by the application developers to run queries for the business. This is obviously not the data-driven dream we were aiming for, with everyone in the organisation empowered by access to insights, but it is a lot better than the previous scenario, and we’re closer to our goal.
As queries get more complex, it can make sense to bring in a data warehouse to allow more dynamic queries to be run. AWS Redshift will integrate well with the Data Lake we’ve established, and Amazon Redshift Spectrum allows Redshift to query data directly from files on Amazon S3 (in a similar manner to Athena). This is instead of having to copy it into the Redshift cluster first, querying the data in-situ.
The use of the term “cluster” should have set some alarm bells ringing. AWS Redshift (with the exclusion of Spectrum) is, sadly, not Serverless. You need to choose your cluster type. Although pay-per-hour pricing is available, the system is not Serverless, but it does provide more functionality. We will decide between the two and show how they can work together next.
Step 3: Insights and Analytics
So far, we’ve moved data around and formatted it for developers and data engineers to be able to do ad-hoc queries. Now, we need to put the data in the hands of the business.
Data-driven organisations need everyone at every level of the organisation to have access to the right data at the right time.
We need the ability for data to be queried by stakeholders without SQL skills, and we need to combine multiple data sources, along with data visualisation, dashboard creation, sharing and automated reports.
AWS QuickSight provides the functionality to pull data from the data lake we’ve established, as well as from other data sources inside and outside of AWS. Users can interact with the data without needing to understand SQL, and interactive dashboards can be created and shared. Additionally, automated reports can be generated and sent to stakeholders’ inboxes, which will help cultivate a data-driven culture and allow everyone in the organisation to see the value of the data-driven approach.
QuickSight can pull directly from S3, as well as from Athena and Redshift. The exact configuration needed depends on the specific reporting needs of the organisation. Direct to S3 is the simplest, but most limited version, followed by Athena and then Redshift. Startups, and those beginning their journey to being data-driven, will get good results with Athena and QuickSight. As an organisation grows into its data-driven approach, Redshift can provide additional functionality.
Another useful aspect of AWS QuickSight is the mobile app, which provides access to dashboards and offers the ability to interact with data while on the move.
Although it can take time to fully transition an organisation to a data-driven approach, getting stakeholders to engage with the app version rather than relying on auto-generated reports can be a big win in terms of security and data protection.
The app can authenticate with traditional methods as well as with FaceID, allowing users fast and secure access — a big security win over PDFs. This also can be a strong improvement to the GDPR, PII and Security aspects of an organization’s reporting process.
Step 4: Artificial Intelligence — Discovery and Future-proofing
Companies look to gather clean data, as it aids not only their day-to-day reporting, but it also helps to leverage AI and machine learning (ML) to compete with industry competition in the future.
Amazon QuickSight offers ML Insights to parse natural language, forecast trends and discover insights. This can be taken further with the (still in preview) integration with Amazon SageMaker, a cloud machine learning platform. Existing SageMaker models can be used inside of QuickSight to augment reporting, and we anticipate such integrations will develop further.
As companies become data-driven, data becomes a first-class citizen. This means a company’s day-to-day operations involve keeping data clean, consistent and up to date, as it directly impacts reporting and the ability to gain business insights. This has the added benefit of ensuring companies look after an asset of growing value: their data. Whether this is to commoditise the data at a later point, use it to learn from users or simply to maintain a data-driven culture, this resource will pay dividends in a not too distant future.
A Serverless BI solution ensures that investment is focused on getting the right data to the right people at the right time. The pay-per-use model ensures that large companies don’t waste money on licences for unused systems, and it lowers the barrier to entry for smaller companies to become data-driven, essentially democratising the data-driven approach to business intelligence.
The right tooling can help ensure success in the cultural changes needed to become data-driven. Once success is realised, the underlying data lakes and pipelines become assets for the future.
It’s important that AI and machine learning be considered, even if not immediately adopted. The data gathered today will be the data used to power future AI/ML systems. Ensuring day-to-day operational data quality by leveraging data-driven analytics today is essential for the future as well as the present.
Coupled with the right architectural approach, and treating data as a first-class citizen in a highly integrative format, we can sow the seeds for businesses to succeed in the Serverless AI revolution around the corner.