Google Cloud Next ’19: an attendee’s perspective

Published in

Google Developer Experts

13 min readApr 28, 2019

Two weeks ago, more than 35,000 people gathered in San Francisco for the yearly convention about the latest in Google Cloud, Google Cloud Next 2019 San Francisco.

As a Google Developers Expert, I had the pleasure to attend, mingle with loads of interesting people (including the other Google Developers Experts, the Google Developer Relations team, Googlers and lots of awesome Open Source people) and be present at the keynotes where new products and services are announced as well as a bunch load of breakout sessions to learn about what Google and other companies do with all this technology.

In this blogpost I’ll give an overview of my highlights of the conference with takeaways I have and links to extra resources structured in the following way:

Keynote announcements
Breakout session summaries
Extracurricular activities

I hope people can use this blogpost to get to know what is going on in Google Cloud, to learn more about GCP and to be inspired to become involved or try to build something themselves.

Keynotes

What I loved about the keynote is how central Open Source was, starting with the announcement of Anthos.

Anthos allows you to deploy, run and manage your applications on-premise, on the Cloud, but not only Google Cloud. The demo in the keynote was actually running on AWS, which makes the ‘Develop once, run anywhere’ dream come true.

Another Open Source highlight was the strategic open-source partnerships with Confluent, MongoDB, Elastic, Neo4j, Redis Labs, InfluxData, and Datastax who are now tightly integrated into GCP, in terms of running, billing and support.

Other interesting announcements:

Cloud Run: serverless containers build on Knative!
BigQuery BI engine: sub-second query response time, with a fully managed, in-memory and high-concurrency system for data visualisation.
Data Fusion, a fully-managed, codeless data integration service based on the open-source product CDAP.
AutoML tables: automated predictive insights using structured data.

Key take-away from the Developer Keynote

A full list of the announcements made can be found on the Google Cloud blog.

Sessions

Next, to attending the keynotes, there were the break-out sessions, where other announcements were made, product demos were given, best practices were shared and other companies talked about their use of the Google Cloud Platform. In this section, I share the highlights and my takeaways from some of them.

When available, I included links to the recording, speakers Twitter page and useful external links.

DevOps vs SRE: Competing Standard or Friends?

This was my first session to attend and one of my favourite sessions overall even towards the end of the conference. It was delivered by the great Seth Vargo, part of the Developer Relations team at Google.

The session started by answering the question ‘What is Dev?’ and ‘What are Ops?’. Dev is developing code, writing software, pushing new features, with an emphasis on agility, while Ops keep the production systems running that software in a desirable state (“it is running, please don’t touch”).

Historically, developers were closer to the business, and, like the business wants, they keep pushing out code, and throw it ‘over the wall’ to the Ops people, who sit further away from the business (e.g. in datacenters), but are the ones that get paged when something fails (not only hardware, but also software bugs). It is thus in their benefit to have well written, tested and deployed code.

The practice of DevOps is one to break down this ‘wall’ and ‘throwing code over’ with some concrete concepts: reduce organisational silos, accept failure as normal, implement gradual change, leveraging tools & automation, measure everything.

SRE (site-reliability engineering) is concrete and prescriptive DevOps that came out of Google for running for example Search or Gmail in production.

It makes the concepts of DevOps concrete with practices you can apply:

Reduce organisational silos: have engineering, SRE, and product working together.
Accept failure as normal: implement SLO’s; how available, reliable should the system be (it depends on the job at hand)
Implement gradual change: reduce the cost of failure (it is easier to debug and rollback 10 lines of code vs 100 000 LOF)
Leveraging tools & automation to make things repeatable: get rid of TOIL, bring value to the system in the long run
Measure everything: first and foremost tangible things relevant to customers like response time, but also things like CPU usage etc. in case you need to find the root cause.

When developing a new service, you will want to decide on the following few things:

SLI: service level indicator: binary value — is the criterium for a specific service met or not?
SLO: service level objective: this is your target for the SLI over time — what portion of the time do you want to meet the SLI?
SLA: service level agreement: business agreement on top of the previously established metrics

Each of the different concepts, involves different people from the organisation:

SLIs involve software engineers, site reliability engineers and product managers
SLOs involve site reliability engineers and product managers
SLAs involve sales and the customers

Naturally, things go wrong from time to time. This is where the concept of an ‘error budget’ comes in: how many failures can we still accept within the SLO for our service. This is dependant on the risk you can accept which in turn will determine your SLO (how many nines do we want to offer depends on how critical our application is, how time-sensitive the delivery is, …). Once you go over your error budget, the development effort needs to focus from delivering new features to improving reliability and availability, until your error budget is replenished.

The end of the session went a bit into ‘toil’: when you should automate processes that are manual, repetitive, devoid of long-term value and highly automatable and when is good and bad. In general, you want to automate as much as possible, but toil can also have advantages and when thinking of automating something, you should look at the ROI (e.g. automate a job that needs to be done every year and takes 15 minutes and would take 20 hours to automate is not a good ROI. In this case you’d want to document the job and share the knowledge).

Find out more information about SRE: google.com/sre or read the free books

BigQuery GIS — A GeoVisual Exploration

This session was held in the DevZone and covered the datatypes and query capabilities in BigQuery related to geospatial data.

Using latitude and longitude, you can describe a point on earth, but for describing more complex shapes like a polygon, BigQuery has a `GEOGRAPHY` datatype, supporting GeoJSON, WKT (well-known text) and WKB (well-known binary).

Combining this with the power of BigQuery — being able to cope with huge amounts of data in a massively parallel way — enables you to join and analyse data from a geospatial point of view, e.g. ask how much datapoints you have within the range of another data point using native primitives in the WHERE clause of your SQL query.

BigQuery Geo Vis enables you to quickly visualise the results of your query on a map, to make the results more intuitive and business decisions in an easier way.

A quick walkthrough can be found here.

Securing serverless by breaking in

This session was held in #DevZone as well.

The sessions started by the introduction of serverless by framing it against other architectures in the cloud: monolithic (cloud handles hardware), containers (cloud handles VM), serverless (cloud handles container …).

Even when deploying just a tiny app of 200 lines to a serverless runtime, you have to realise your app is probably a lot bigger in terms of lines of code. And this begs the question: are these lines actually secure as well?

Key takeaways were:

Check vulnerabilities within your dependencies. Your code might be fine, but someone else’s might not be!
Deploy granular functions and permissions.
Don’t rely on function ordering: you need to secure every function separately and not only the ones that get exposed.
Worry about all functions.
Don’t rely on immutability: assume servers can be reused.

Full slides here.

Meet the Authors — Go language

Speakers: Tyler Bui-Palsulich, Megan Byrd-Sanicki, Brad Fitzpatrick, Robert Griesemer, Ian Lance Taylor, Robert van Gent
More info
Recording

In this session (again in the DevZone — yeah, I hung out there a lot), the panel consisted of the behind Go. The session was mostly structured as a Q&A with both prepared questions as well as questions from the audience, which made it hard to capture a lot of information and unfortunately, I haven’t found the recording.

What I was able to capture:

Why use Go? Easier to manage, easy to learn, very performant, good to use on cloud (Kubernetes is written in it). To sum it up: fast & fun.
What is new? Warming up for moving to Go 2! Listening to community input and contributions, both additions to and removals from the language, improvements to dependency management, checking and responding to errors, deciding on the inclusion of generics or not, and fewer things in the standard library.
What was the motivation for making Go? Do better than Java and C++, make a compact, performant language in which concurrency is easy to do.
Biggest challenge in Go? Saying no (against new features)

For a small intro to Go, check this article by Hackernoon on Medium.

How Twitter replicates Petabytes of Data to Google Cloud Storage

In this session, Lohit, Senior Staff Software Engineer at Twitter, went into the architecture of the Data Infrastructure for Analytics at Twitter. This infrastructure is mostly based on Hadoop clusters and record over 1,5 trillion events every day. Several features were introduced, among things the FileSystem abstraction, the Data Access Layer (DAL, containing metadata) and a front-end for exploring data sets (Eagle Eye).

On top of ViewFS, Twitter build a replication service where the destination is responsible for replicating and syncing with the source. They decided to extend this service with replication to Google Cloud Storage in order to leverage Google Cloud’s data processing capabilities like BigQuery. In process they relied heavily on the usage of the Google Storage Connector. The total move involved over 300PB’s of storage

More info on the move on this page.

Chaos: Breaking your systems to make the unbreakable

This session was about chaos and what chaos is about: systems are in a constant state of failure (it is not binary). The best way to avoid failure is to fail constantly. Failure is there to learn from.

So how do you practice chaos (without being a jerk)? You need to establish rules to keep it fun and educative.

1. Keep it short: 90 minutes should be enough.

Spend 30 minutes on planning:

Schedule it (when are you going to do it?)
Pick tests (what are you going to break?)
Write down what you expect to happen (what should happen?)
What will you do when things go wrong (what is the fallback plan?)
Share the document with the engineering organisation

50 minutes are allocated to playing (fun part — break things and see what happens):

Start in staging, run it in production (off-peak), later run it in production (primetime).
Announce that you will start in group chat.
Maintain discussion in group chat.
Monitor for outages.
Run your tests and take notes.

Add on 10 minutes for reporting:

Create tickets to track issues that need work
Write a summary & key lessons
E-mail to engineering
CELEBRATE!

2. Have a small team (usually 2 people)

Subject matter expert, the person that built the service
An SRE: who is really expert on keeping things up and running
(Optional: junior engineer or developer for mentoring and a fresh view on the systems)

3. What are the levels you want to play at — there are different options:

Level 0: Terminate service. Block access to 1 dependency
Level 1: Block all dependencies
evel 2: Terminate host
evel 3: Degrade environment (e.g. network slow, dropping packets, malformed information)
Level 4: Spike traffic (DDOS yourself)
Level 5: Terminate region/cloud: failover to other cloud or on-prem

The session ended with a demo of a chaos experiment in which the above concepts were applied — check it out here.

Advances in Stream Analytics

Speakers: Sergei Sokolenko and Thomas Weise
More info
Recording

I covered this session quite extensively on Twitter — lots of exciting announcements and learnings from the Google Cloud Dataflow team and learnings from an Apache Beam deployment at Lyft:

Towards Zero Trust at GitLab.com

Speakers: Kathy Wang and Philippe Lafoucrière
More info
Recording

This session dealt with a topic that has been grabbing my attention later and I was keen to learn more on: how modern companies do security.

Traditional companies have a hard on the outside, soft on the inside approach. As an industry, we know this does not work, but this is still how most businesses are set-up.

Zero Trust: all devices and users that are trying to access an endpoint need to be authorized and authenticated to do so. All the decisions involved in this process are dynamic and risk-based.

It is not a product — it is a process, it is not new, and not often implemented before a major breach (only ~20% of the cloud-native companies have only implemented or started implementing zero trust).

What are the benefits:

Lateral movement is much harder (services are separate perimeters)
Stolen credentials are less valuable
Known vulnerabilities that are easy to exploit will be rarer
Non-targeted attacks have less value (resulting in higher cost for the attacker)

Before Gitlab.com embarked on their zero trust journey, they had a few things already in place: data classification policy, GCP security guidelines (enforced by Forseti), internal acceptable use policy in order not to rely on good intentions, and an HR system to know who sits in your organisation in order to give the appropriate access.

They then went into the 3 problems they solved on their journey to zero trust:

1. Managing User Identity and Access: answering a series of questions:

How do you verify endpoint integrity?
Is the person accessing data appropriate to the role?
How do you streamline onboarding/offboarding?
How do we minimize cred theft?
How are we enforcing our data classification policy?)

2. Securing our applications:

Shift security to the left in the pipeline and merge requests by educating developers and scanning every commit.
Applying Binary Authorization by only deploying trusted container images, removing the human from the deploy process, sign and annotate images during the CI phase.
Key management service
User and entity behaviour analytics

3. Securing our Infrastructure:

Vulnerability management: deploying patches in a timely manner
Who owns what asset: answered by the asset database
How to migitate abusive activities?
How to make it harder for the attacker to move laterally?
Apply Google’s best security best practices for GitLab.com
Enforcing policies in order to avoid having to rely on best intentions

How was the journey organised? Bucketise the different parts:

GitLab.com: infrastructure that handles customer data (centrally-managed)
Endpoints: user and employee laptops (individually-managed)
Backend infrastructure: 3rd party applications

Implementation across these buckets was done in parallel.

Lessons learned? It is an ongoing implementation, and ordering matters (some implementations will facilitate others), UX is important: people need to be able to get their work done, automation is key to scale, zero trust is personal to your company and your requirements.

Slides can be found here and I have also enjoyed reading through the papers Google has put on their website here.

Extracurricular activities

Of course, the Summit was a great opportunity to meet, learn from and hang out with people from all over the planet. A couple of highlights best captured in pictures:

**Beam Summit organisation in the after hours**

**Hanging out with (old and new) open-source friends!**

**Visiting the Google San Francisco offices and making more Apache Beam friends**

**Beam meetup in the Community Corner at DevZone**

Wrapping up

In case you are interested in learning more about other sessions and Google Cloud Platform in general, these are good resources to check out:

Day-per-day wrap-ups on Google Cloud blog: Day 1, Day 2 & Day 3
Twitter: Start following the Google Cloud and Google Cloud Platform handles on Twitter.
You can check the playlist of Google Cloud Next ’19 sessions on YouTube.
Google Blog nicely sums up the 122+ announcements made at Google Cloud Next ‘19.
Alexis Moussine-Pouchkine, part of the DevRel team at Google, writes up weekly blogposts called TWiGCP (this week in Google Cloud Platform), and he did one for Next ’19 as well: click.