Photo credit: Phil Goerdt

No, the Cloud is not too expensive

You’re just doing it wrong

Phil Goerdt

--

I recently introduced my Damn Dram project to the world, right here on Medium, and I promised to walk through several aspects of this project from a technical perspective. (If you’re here just for the whiskey, I recommend you head over to the Damn Dram; there’s a lot more fun content that won’t sound like the water cooler near where all the IT guys and gals sit.)

But, if you’re here for the technical stuff, I’d like to walk through how I went about making this project a reality. Like I said in the first post, this series is about finding value, learning new things, and giving some real talk in a world that is full of filtered pictures, POCs that only work in a vacuum, and everyone presenting and blogging on the same error free datasets from Kaggle.

In the beginning…

The beginning of this project was born out of a few things coming together. The more influential of them was a whiskey tasting party that my good friend Gavin and I hosted in late 2017. We created scoring cards, invited some people over, sampled a ton of whiskies and gave out prizes to the people that brought the crowd favorites. It was a great time, and the basis of the scoring system I use was created and tested that evening.

That’s my kind of party!

The other thing that pushed me to do this is that I had been trying new whiskies on a regular basis, but had begun to run into the problem of having tried a whiskey one time many years ago and not remembering many details of the specific whiskey. That’s a problem; no way am I going to risk my hard earned cash on something I can’t remember if it was good or not!

So, the idea to log every unique whiskey that I tried in 2018 was born, and pretty soon it stopped being a quick tasting that I would do whenever I happened across a new whiskey, and turned into a bit of an obsession. Those that know my competitive nature knew that once I had the idea in my head of hitting at least 100 unique whiskies rated by the end of the year, there was no stopping me.

Back to basics

To capture the data, I created a Google Sheet and would write in results from tastings into the sheet. It’s easy to get to from both mobile and laptop, easy to enter data and use, and it’s free. Once I had a decent amount of whiskies sampled, I created a Google DataStudio report that pointed to the Sheet. Easy peasy. I could take a look at what, where and who was scoring the best, if I had any outliers, and if there were any general trends.

Google Sheets to DataStudio. Super easy.

As you may imagine, this set up got more and more annoying as I got more and more data into the sheet. I have a terrible time typing anything right the first time around to begin with, so you can imagine my frustration using Google Sheets on my phone on a regular basis. Additionally, there were limits with what I could do in both Sheets and DataStudio. Ranking, complex groupings or subqueries, or data cleansing types of activities are just not at home (or allowed) in either of those tools.

Time to think of something new.

Use the compute, Phil

Once I decided that I was going to try to do something a bit more robust with the whiskey data I was collecting, it’s easy to guess what I did next. It’s exactly what any GCP loving data person would do: put the data into BigQuery.

I’d be lying if I said I never did this before the Damn Dram project.

This initial pass of moving the data into BigQuery was relatively simple. BigQuery allows for externally sourced tables from places like Google Cloud Storage and Google Sheets, meaning that I can create a table in BQ simply by pointing it to the shareable URL link of a Google Sheet. Simple and effective. And once the data was in BQ, it’s easy enough to write some views to handle the use cases I was looking for; rank ascending and descending, do aggregations and do some general data cleanup. With the views created, plugging it all into DataStudio for reports and visualizations is easy.

While moving the data to BigQuery worked nicely (for the most part, and more on that in a bit), I still had an issue with data input. At this point I began to think that this project was going to continually evolve into something greater than I had initially planned, so I decided to take a step back and think through what my objectives were, put some guard rails around them, and come up with a plan to execute.

On Cloud 9?

When I took a step back, I decided that this is something I wanted to be able to share my passion for whiskey with other people. And keeping that in mind, I had several principles to abide by when I started to plan the rest of this project.

  1. Keep it as simple as possible.
  2. Keep it as cheap as possible (with bonus points for keeping it free).
  3. Keep it as low maintenance and failover tolerant as possible.
  4. It must scale automatically, or at least be built in a scalable way.
  5. It must be able to be delivered quickly (by the end of 2018 if possible).
  6. Keep it all in GCP/G Suite, if possible.

Let’s stop and think about why putting some fences around this before moving forward was a good idea.

  1. This is a project that I am creating for myself (and a small audience), and out of my own wallet. Of course I want to keep it as cheap as possible.
  2. Since it is something that I will be maintaining by myself, I don’t want to worry about this solution being inflexible. No one likes opening their email to see automated error reports… well, at least I don’t. I like to avoid these types of situations, so building flexibly and in a scalable and fault tolerant way is key.
  3. I wanted to close out 2018 by finishing this project and making room for new goals in 2019. Keeping everything in one platform that I already know reduces complexity and increases productivity. Hence, the (one of many) reason(s) to go with GCP.

The points above were my reasons for going the route that I did… but I think that these could be adapted for many situations. Pick the right tools for the job, think about design, scale and cost up front, and be willing to adapt.

With the above defined in late August of 2018, let’s think of a few ways of how we could go about this.

Option 1: A Custom Web App

Try this on for size: a custom app hosted in App Engine that publishes to PubSub with DataFlow as a subscriber writing to BigQuery tables. Obviously we would use DataStudio to visualize everything.

This definitely would work, and it definitely will scale. The trouble with this approach is this project was a one man show, and I didn’t want to waste a ton of time writing node.js or python apps for App Engine. (Admittedly, that’s not really my thing). Also, while my data production and consumption would be low, this is too many moving parts for me to want to manage (even if it is, wait for it, serverless).

Maybe something like this?

Option 2: Hack it

The other thought that came to mind was hacking something together using products found in Google Apps/G Suite and GCP. This could look like using Google Forms for data input, feeding that data into Sheets and loading that data into BigQuery, and then view the data in DataStudio.

This approach takes away the need to develop a custom app for data input, and it removes some services from the architecture. Because of the reduced number of services and both Forms and Sheets being free, I’m coming out ahead on this one. Plus, it’s also serverless.

The integration between all of these products is amazing.

Option 3: Any variation of other GCP products

It’s been said before, but there are many ways to skin a cat. There are plenty of other ways I could do this, but none seemed close to either Option 1 or 2 in terms of building something that fit within the parameters I set out to fulfill. Focusing on the objective and creating and sticking to SMART goals at the beginning allowed me to be successful. It’s easy to get bogged down in the weeds when brainstorming and too often we get attached to an idea even though it is the wrong solution. Take the time to define what success looks like; it will save you a lot of pain later.

Option 4: Don’t change anything

Of course, sticking with the status quo is always an option. (Well, not for me, but maybe for some!)

And…?

I ended up choosing Option 2. Some may be disappointed, but in reality, this is a side project that I worked on in the evenings and on the weekends. Spending a significant amount of time to build something in App Engine when Forms was 1) free and 2) probably better than what I would come up with was out of the question. Additionally, using the additional services of PubSub and DataFlow was pure over engineering for this project. However, since I went thought this exercise, I already have identified a roadmap to scaling if and when the time comes. (Comment below if you’d like to see some blogs on that topic.)

Some may think that because my architecture does not contain a robust set of products or that since I didn’t write dozens and dozens of lines of code makes it irrelevant. I would argue the opposite. Just as data has become democratized in many organizations in the last 10 years, so has the general toolkit to make things happen. Any departmental team in any org (with access to G Suite and GCP) can set up something like I did and get tremendous value with little cost. So, why get upset about that?

This doesn’t mean that the story ends, though. The final architecture of what I came up with looks different than what I proposed above. This is because we need to talk about managing that data coming from Forms and Sheets into BigQuery, the ML models and data sets behind that, and hosting and sharing all of this glorious, delicious content. Many of these concepts will be covered in detail in later blogs, but here’s a summary of what I came up with.

Progressing the architecture

[App] Script it!

Some of you that have worked with external table sources in BigQuery may know that sometimes this can cause some problems. In my case, I was seeing issues when creating views off of these externally based tables that prevented me from saving the view. If only there was an easy and cheap way to automate some jobs that could write those sheets into a BigQuery table…

Lucky for me, there is a way to do this. I wrote an App Script to query the Sheet-based BQ table and write those results into a new, internal table (which had no problems being used in a view). The cool thing about using App Script is that I can set a timer for these types of scripts and queries to run on a schedule. Mine is set for once a day, but in theory I could have it run every minute or on set days if I wanted to.

Learn ‘em good

Once all of my data truly lived in BigQuery, I determined that it was time to see what I could learn from the data. I used BigQuery ML to train some models on my data set. I used BQML in keeping with the goals I set for myself above, but using Cloud MLEngine with TensorFlow models is a real possibility (and maybe in the works).

To test these models, I had also written some scrapy web scrapers in python to gather data about whiskey from the web. In total, I was able to catalog around 4000 unique whiskies to put my ML models to the test. The scrapy jobs were run locally, but mostly because 1) I’m cheap (as we established before) and 2) collecting this data set was a one time activity. After collecting the data, it was loaded into Cloud Storage, run through Cloud DataPrep for some (one time) clean up, and then written into BigQuery.

Simple. Elegant. Powerful.

If it were going to be a recurring activity, I would deploy the code and run the scrapers from Cloud Compute Engine and automatically write to Cloud Storage. From there it could go a variety of different routes to end up in BigQuery, but most likely it would be via PubSub and DataFlow jobs.

Oh, what could have been…

After the models had been trained, and the testing set was determined, I ran the models to come up with prediction scores, wrote those to new tables and created visualizations in DataStudio on these findings.

The end? Not quite.

Here is the final architecture diagram from a conceptual perspective.

You may think that there is no way the above solution works. But believe me, it does. And now, the only thing I worry about is filling out ratings of whiskies, as opposed to worrying about having to update things and run jobs. However, once you start a project like this, there are always enhancements… right?

Some of you may be wondering how cost effective this was as well. I’ll review the GCP related costs here and ignore other items like domain name, SquareSpace subscription, developer time, etc. But, to develop all of this and basically have it in a prod state was less than $33. $33! How cool is that?!

We’ll see how much the costs fluctuate and perhaps I’ll post some updates on that in the future, but $33 in cloud costs to develop a pretty simple app that requires basically no maintenance is basically free. Heck, it is free when you factor in that I had a $300 sign up credit from GCP that I used for those $33. Not a bad deal.

The cloud is maligned by some for being the same cost or more expensive than on premises solutions. This can be the case, but my guess is that the technical problem at hand hasn’t been adapted to the new world of cloud. The could isn’t “just someone else’s computer” as the memes would have you believe. It can be someone else’s computer that is automated and you only pay for what you use. But… achieving that state takes work, planning and knowledge of what is possible.

For me, the Damn Dram project would be unfeasible if I didn’t have the power of the cloud to back me up. And if the cloud can help one guy keep track of which whiskies he likes and what to try for (almost) free, what do you think it is capable of if you really give it a chance in your org?

I stand by the title. It isn’t that the cloud is too expensive that is keeping you from taking the leap. Because if it is, you’re doing it wrong.

Phil Goerdt is an independent consultant specializing in cloud architecture and data engineering.

--

--

Phil Goerdt

Consultant. Data & Cloud architect. Founder. Petrolhead. Traveler. Foodie. Geek. Opinions my own.