Building a Cloud Data Application with Snowflake

Learn how to monetize data and build data-intensive applications with the Snowflake Data Cloud.

In September 2019 I have shared One success story of the cloud delivery, brief description about one small data project which is being developed by using cutting-edge BI cloud tools and technologies like Snowflake, AWS, Dataiku, or Apache Superset. The project which I can be part of. Today I have an update for you. We have not been sleeping for those 14 months but intensively working with a focus on bringing more value for customers, make the solution more secure thanks to the latest available Snowflake features and follow the best practices in relation to the architecture of data apps in the cloud. Last but not least make the solution easier to maintain & extend. All of that still with that small team of just a few developers. In November 2020 we have reached another important milestone and our data app has become a PRODUCT. Now it is being offered to customers who can buy subscriptions which allows them to access the various dashboards and gain important data insights from their data.

This is a story about the evolution of the small POC into a serious data application with strong business value for both parties — a data producer and a data consumer. It is a story about the monetization of enterprise data and bringing to life a new revenue stream. It is a story showing how the latest cloud technologies are the right choice for modern data apps. It is a story about building trust between an enterprise and its customers. In this blog post, I would like to go through the whole journey from different perspectives. Starting with project architecture and used technologies, through the description of the business value behind the project and why you should be always thinking about how to monetize your data, and ending with the future plans for the project because this is not all. Software development is an iterative process and we are not done yet. 🤟🏻

Architecture perspective

As I’ve mentioned many times, we have been developing a cloud data app. It is running in AWS cloud and as such, it is using many AWS services but the key building block is the Snowflake Data Coud. I want to avoid detailed architecture schemas with tens of boxes and lines where you need a guide to be able to read it properly. Instead of that let’s try to look on the architecture from different perspectives. I have already introduced security related architecture improvement in post related to data security in Snowflake.

Would be good to start with overview of the used technologies and tools to emphasize that in today’s decoupled world it is not about two or three tools but much more. Here we go. From AWS we use following services:

Last but not least we also use services which are not „visible“ anywhere but are necessary like Amazon VPC, AWS Identity and Access Management or Amazon Elastic Load Balancer. Pretty long list, isn’t it?

Of course we are not 100% Amazon positive and still have many other tools which plays key role in the whole setup as is already mentioned Snowflake or Dataiku Data Science Studio. That is our tool for building data pipelines and orchestration.

Technology stack

For the presentation layer we use open source tools from Apache family. It is Apache Superset together with Apache Echarts which brings stunning visualizations. This is also quite unique aspect as we try to integrate open-source reporting tool into enterprise world. This „big corp“ environment is used to tools like Tableau, Qlik, Power BI or Looker but we go by different path and trying to keep the costs on reasonable level.

High level Architecture

Project technical progress

We have come a long way since project start. We started as small POC with focus on fast delivery of first use case to validate the idea behind. We got 10 weeks at the beginning to deliver first insights to customers. 10 weeks where we had to build everything, starting with infrastructure part, over Data pipelines and DB work, up to dashboard development which goes hand in hand with building up the whole reporting platform. Main focus was on speed. There was no time to do fancy things like long discussion about how we gonna do whole DevOps or building strong automation behind the scenes.

But as time goes we try to move to more automated solution with less maintenance. We are getting there step by step. We have started to build automated test cases using Selenium which helps us to speed up the deployment process as well as the testing itself. We have been adding more and more logging for auditing purposes. We have implemented auto scaling to be prepare for thousands of customers. We have added load balancer to eliminate application downtime. We are slowly getting into solution which will be more like CI/CD pipeline than traditional deployment model. We have still a long journey ahead of us to get there.

Use case description

What kind of problem does this project solve? In general it tries to provide valuable data insights about invoices (timelines & trends, cost segmentation, services overview) and usage data overview, aggregated through various dimensions. All of that with many filters which makes finding the proper data set for the analysis even easier.

In usage data area the most valuable is roaming cost data analysis because in today’s world which is full of unlimited subscriptions for mobile services, the roaming is one of very few items on your invoice how you can significantly affect your bill. Analysis of roaming data in terms of finding outliers and subscribers on incorrect price plan can bring serious cost savings.

Why would any company make such data available to customers? And even helping them to get correct understanding of the data? This is about building trust partnership between both parties. Enterprise appreciate their customers and trying to provide them the best service they can in order to make them happy. Using available data to improve own services seems to be right move. The other side — customers then strengthen their trust in company when they see additional effort to maximally adjust the provided service even though it might end up with lower revenue for the service provider but it will brings much bigger customer satisfaction and happiness. Then there is no reason to leave. 👉🏻 Win : Win

Data monetization is the new Holy Grail of data apps

Everyone speaks about it, everyone wants it, but in the end not many corporate projects are able to achieve that. We are talking about data monetization. Many project I have been part of in last 6 or 7 years have this in the pipeline as something nice to have and point which we want to reach once we finish this and that. We have never reach that point.

Even though our use case is more than suitable for monetization you might be saying that it is not your case. You might have data where the potential is not clearly visible. But I believe that data monetization is much easier in today’s cloud world and especially with features which for instance Snowflake offers to support it. Features like Snowflake Data MarketPlace or Snowflake Data Exchange. Maybe you do not have cool project and unique data but definitely you have something what is important for your partners or customers. Thanks to those features it is easier than ever to share your data and now it is even possible to offer them publicly. There might be more organizations and projects who might consider your data interesting and the data may help them to solve their own challenges. It could be just one of your dimension, just a small dataset.

Next Move

In last 14 months we have built robust foundation which is prepared to serve to thousands of customers. Solution is already maximally secured by custom data encryption with data separation thanks to Snowflake reader accounts. Whole architecture is ready to scale to meet all customer demands where scaling is prepared in all architecture layers starting with ECS auto-scaling for our application docker image , up to Snowflake scaling possibilities where we are ready to scale up (bigger virtual warehouse) same as scale out thanks to Multi-cluster Warehouses. But we are not done yet. There is plenty of things which waits for us.

More automation

We want to slowly get more into CI/CD and DataOps. It requires more automation from our end. We have already focused on automated testing where we use Selenium and Jypyter Notebooks . On daily basis we run defined test cases to validate the application functionality same as data correctness. There is still room to improve our data quality verifications by leveraging the features which Dataiku offers for this (metrics, checks, statistics) or adding another special tool like Amazon Deequ.

In terms of deployment pipeline we need to improve the part related to deployment of Snowflake objects which is now heavily manual and very similar situation is with Dataiku projects.

UI Improvements

There is always something to improve on UI to make the app more usable for users. We are planning to work more on filter area where we can try to offer saved filters or more intuitive filtering. We have also plans for improvements in app navigation and home page organization.

Data Science

We have a best tool for Data Science and Machine Learning as a part of our architecture but we have not been using its ML possibilities yet. There is a big room for ML use cases.

New dashboards

We are constantly getting the customer feedback and requirements for new features. There will be definitely more dashboards.

Wrap-up

And that’s it! I wanted to show and tell how the small idea at the beginning can grow into real product within the year thanks to mainly cloud based technologies, decoupled architecture approach and DataOps way of working. I think we have been able to perfectly combine the right tools and services to create a great product in reasonable amount of time. Starting with AWS — industry leader in cloud infrastructure with hundreds of services to choose from. Then Snowflake — simple, scalable, cost effective DWH as a service with many great features and possibilities where the scalability works as a charm! With Dataiku Data Science studio we can create data pipelines quickly and easily. Last one, Apache Superset — open source reporting platform with zero cost where you can build whatever you want because it is open source. 🤟🏻

Last but not least, nothing would be possible without great team behind the scenes! So Thank you Ruchi, Gustavo, Arul and Jan. It is really a pleasure to be part of the team. 🙏 And thanks to Heine for his endless support without that there would be no project at all.

--

--

Tomáš Sobotík
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

Lead data engineer & architect, Snowflake Data Superhero and SME, O'Reilly instructor. Obsessed by cloud & data solutions. ☁️ ❄️