Cloud-native, of course. Photo by C Dustin on Unsplash

Platform thinking: do’s and don’ts when designing a cloud-native platform from scratch

Floor Eigenhuis
incentro
Published in
8 min readMay 21, 2021

--

Fellow engineers: how does the following (often untrue) tale sound to you?

You are a software engineer. You get the opportunity to design a platform that has the potential to reach millions of users. The platform plays perfectly into CSR(corporate social responsibility). You have no legacy software to work with and you have complete control over how you want to design the platform. The only requirements to the platform are that it is built in a future-proof way and that security and privacy are built into the platform from the start. How does that sound to you? Well, if someone would say this to me I’d say there’s no such thing. Being a software engineer, 95% of the time you have to build on whatever has been built already. Nonetheless, this is exactly the opportunity we had and I’m going to tell you aaaaaalll about it in this blog. Let’s go! 🚀

The idea.

Working at an agency that helps its customers with their digital transformations, we rarely have startups as customers. Of course, hiring external consultants costs money, which is something startups rarely have access to. Once in a blue moon, there’s a startup with a really innovative idea that has managed to get some funding but has no IT knowledge. The name of that startup is The Social Handshake, and that is where we came in 😎.

We designed and built a platform called LoonGift, in which doing good is made easy by enabling employees to directly donate part of their salary to a charity of their choice. The amount you want to donate is deducted directly from your pay stub and your employer has no access or insights into which charities you donate to. At the end of a fiscal year, you get tax benefits over the amount that you donated to charity. It’s easy for participants to sign up and change their donations. Another added benefit is that employers pay for the use of the platform, so the entire donation from the employee goes to the charity. Also, the charity cannot ask you for more money or send you any other information if you haven’t signed up for it and the charities don’t pay retention costs or acquisition costs.

The concept behind LoonGift is called payroll giving. Payroll giving has already been popular in several countries throughout Europe, for example in Germany and England. Nothing like this existed in the Netherlands, however.

The how.

So how did we go about doing this? Given the fact that nothing like this exists in the Netherlands yet and given the fact that the concept has proved itself in other European countries, we had a unique opportunity to design a platform that can potentially be huge from scratch! Truly cloud-native. ☁️ Since we (Incentro) are a Google Cloud partner and Google Cloud Platform is AWESOME, we built everything on GCP.

We did use some techniques that were new to us. Usually, we help our customers with their digital transformation. That means that we have to work with whatever the customer already uses in their IT landscape, which usually consists out of REST APIs. In this project, we had the opportunity to work with some techniques that were new to us. In this case, we made a gRPC backend over HTTP2. Especially when using microservices that have to communicate with each other, gRPC is very useful. gRPC allows the backend microservices to communicate and interact with each other as if they were using a local object to call functions from. Together with protocol buffers, payload size is also drastically compressed. On the downside, gRPC does not work well with frontend applications directly, so the requests from the frontend had to be routed through an ESPv2 proxy, in combination with Cloud Endpoints.

It’s a constant tradeoff between ‘good-enough-for-now’ and ‘future-proof’

Unfortunately, the fairytale has to end somewhere: we did not have unlimited time and resources to build this amazing platform🥲. Like in all software projects, we had to make choices about where to focus our time and energy on and which features could be delayed. Building such a platform from scratch is tricky because it’s a constant tradeoff between ‘good-enough-for-now’ and ‘future-proof’. Luckily for you, I’ll share our tips, tricks and struggles here, so you don’t have to find out the hard way 😉.

Containers eeeeeverywhere. Photo by OSG Containers on Unsplash

Platform thinking

So, you want to design a platform that is future proof. How do you go about this?

Make sure that the technical debt you’re creating is as replaceable as possible.

  1. Think ahead
    It’s impossible to know everything you want in the future. There are multiple known unknowns. You inherently build in some technical debt from the start. This does not mean that you can just start building and see where you end up! You have to have an idea of what you’re designing for. Together with our UX and UI designers, we made a prototype of ‘the dream’ first. ‘The dream’ was a prototype of the complete platform, validated by potential end-users. With ‘the dream’ in mind, we had an idea of what we should work towards and how functionalities should work in the future. After we prototyped ‘the dream’, we could decide on what our MLP (Minimum Lovable Product) was going to look like. With ‘the dream’ done, we could also make architectural and technical decisions that were perhaps not strictly necessary for the MLP but are crucial to long-term success for LoonGift.
  2. Make the inevitable technical debt replaceable
    Even with designing with the future in mind, it’s impossible to know everything beforehand. So, with these known unknowns, make sure that the technical debt you’re creating is as replaceable as possible. This means that you might want to replace some functionality, even in another programming language. So, obviously, like most modern applications, adopt a microservice design. Make sure that every microservice is responsible for its data and its own data only. Modifying another services’ data is never allowed. This way, you make sure that you can relatively easily replace a service if needed. Keep your API documentation in one place: define your objects and endpoint definitions once, and make sure that every service ‘knows’ the objects that exist in your landscape. Ideally, choose a way to structure your data that enables multiple programming languages to work with the same object specification. A great way to specify and serialize your objects is Protobuf — and available in many languages. Our services were mostly written in Java, but if we’d decide tomorrow we want to write a service in Python, this should not be a problem. The new Python service can read from our existing data structure proto files.
  3. Security by design
    Sometimes, security, user-friendliness, time, and budget might be a little bit of a trade-off, but security should never suffer. Especially if you work with sensitive data, think about the unhappy flows upfront. What realistically can go wrong, will go wrong. There are still enough edge cases you have not thought about yet, I promise. Make sure that the edge cases you can think of are covered. In these cases, with a limited budget and high ambitions, I like to use managed services as much as possible. Managed services on cloud platforms have built-in security. Especially in a project like this where resources are limited, always prefer the managed services for security and ease of use. For example, on Google Cloud Platform, you can use containers with Kubernetes (obviously). Its managed sibling is called Cloud Run, and it takes care of a lot of the things you don’t want to worry about. For example, it’s super simple to close off your Cloud Run instance and not allow unauthenticated invocations (and it’s super scalable out of the box!). Firebase Authentication is a great way to handle logins without having to store the passwords yourself.
  4. Prepare for ‘smart data things’
    Machine learning is hot right now (obviously). And even if you don’t have a specific use case in your mind right now, prepare for the future! Make data flow freely throughout your microservice architecture. Have every service produce data, and have another service pick the data up. Every service that has subscribed to the data from the producing service (and, of course, is allowed to see the data) can do whatever they want with it. This is a perfect architecture to combine data from different sources to potentially perform machine learning or large scale data analysis on! My colleague Hayo wrote an excellent blog about this, you can find his article here.

The takeaways.

Most of the things I told you in this blog we found out the hard way. We were so excited to design a new platform from scratch that we wanted to use the newest technologies. With the newest, I mean really the newest — several products and managed services that we used were still in beta. For example; we deployed our microservices on Cloud Run in combination with gRPC, Cloud Endpoints and an ESPv2 proxy, which was still in beta at the time we started on the platform. It worked well until… it didn’t. Running into issues with timeouts and CORS is never fun, but especially not when you have strict deadlines to work with. I shouldn’t need to tell you that using stuff that’s in beta can cause some troubles in production environments, especially if resources are limited. We pulled it off, but for the sake of your sanity, I would advise against it 😄 .

Also, try to challenge the stakeholders. We wrote the gRPC microservices in Java because that’s what our customer wanted. However, in hindsight, Python might have been better. Especially if you work with decoupled microservices, and really micro microservices, think about the start time of your containers. Cloud Run scales down to zero, and if the microservices communicate in some way or another, think about the impact your design and programming language has on the start time of your container. Java is a language that is notoriously slow to start up, so slow we had to deal with timeouts from the ESPv2 proxy. Of course, there are ways to circumvent this, but the question remains if it hadn’t been easier if we had just written everything in Python from the start.

Pages turned and lessons learned, the result is really something to be proud of. A scalable, cloud-native, future-proof platform with no infrastructural maintenance (thank you, managed services! 🙏). Go check it out on platform.loongift.nl and if you’re in the Netherlands, check out if your employer can join too!

--

--

Floor Eigenhuis
incentro

Machine learning engineer and enthousiast. Interested in the technical and ethical side of things