Did you know that a new feature was recently rolled out for Apache Beam that allows you to execute SQL directly in your pipeline? Well, don’t worry folks because I missed it too. It’s called Beam SQL, and it looks pretty darn interesting.
In this article, I’ll dive into this new feature of Beam, and see how it works by using a pipeline to read a data file from GCS, transform it, and then perform a basic calculation on the values contained in the file. Far from a complex pipeline I agree, but you’ve got to start somewhere, right!
It’s no secret to many that I’m somewhat of a BigQuery fanboy. I’ve been using it for over five years now, and IMHO it’s still the best tool on the GCP stack. Don’t let those Kubernetes/GKE folks tell you otherwise. Pfft.
With the start of the new year, I thought it would be a good idea to take a look back at the year that was for BigQuery, and compile a list of what I think the most prominent and exciting releases/updates throughout 2018 were.
Note: each date/heading in the list below provides a link to the release notes for…
Update 28.10.2018: I’ve changed the solution so that it now automatically detects and reads the schema of the source table. This means you don’t need to bother specifying the schema in the YAML config anymore. However, it doesn’t support complex schemas with nested fields yet.
Update 05.11.2018: I’ve added the ability to automatically create the target dataset in BigQuery in the correct location/region.
A few weeks ago this article popped up on the Google Cloud blog. It describes a solution that uses Cloud Composer (Google’s fully-managed Apache Airflow service) to copy tables in BigQuery between different locations.
You see, currently…
“This story was based on fact. Any similarity with fictitious events or characters was purely coincidental.” — Richard Linklater
On a recent GCP project, a customer did say,
“Can you spin up a pipeline by the end of the day?”
One simple requirement, easily understood,
Analyse live tweets with BigQuery under the hood.
Now I know what y’all thinking, “just use GKE!”,
But spinning up Kubernetes wasn’t to be.
Anything but PaaS frankly wasn’t allowed,
Never to tell why, is something I’ve vowed.
It could never be done, they did nervously claim,
“Nonsense, challenge accepted!”, I did joyfully exclaim. …
A lot of the projects that we work on are focussed on data ingestion and analytics on Google Cloud Platform (GCP). Being lazy (or smart?), I always try to use as many of the PaaS and SaaS offerings on the Google stack as possible to make my life easier.
This saves lots of time, and allows me to focus on the problem rather than toiling in the muddy fields of infrastructure. Who the hell wants to be spinning up VMs in this day and age anyway, huh?
In this post, I’ll describe how you can use Google’s Cloud Build tool…
Some time ago I tweeted about my tips for using BigQuery in the enterprise. For some reason people liked it and it got quite a lot of attention. So, in the interest of posterity, I wanted to make it easier to find these tips and hence the reason for this article.
I’ll continue to do this in future for other stuff too. Also, if you’ve got some of your own tips to add to this list, then please feel free to comment, and I’ll add them — but only if they’re as bad as mine.
✅ Export logs from all…
The last time I attended Google’s annual cloud conference was in 2014. Back then it was actually called something completely different, ran for just one day, and had just 250 (invite only) attendees. Ah, the good ‘auld days of yonder years!
Anyway, this year the it spanned a whopping 3 days (excluding other periphery events like the Partner and Community Summits), and the estimated number of attendees was 20,000. Yeah, I think it’s safe to say that this Google Cloud thing is kind of a big deal now!
In this article, I’d like to share my highlights and key takeaways…