Getting to know the cloud: Year 1

Sid Shankar
10 min readSep 29, 2019

--

When I was 6 years old, I remember asking my father what he did when he routinely vanished every day, for a good part of the day, to this mysterious place called “the office”. I remember him telling me that he worked with “software”. That word seemed so funny at the time. The 6-year-old me heard it as “soft wear” and immediately thought: “Hmm… my dad makes clothes for a living”.

Odd and obscure anecdotes aside, when I first started building software as a hobby, and then professionally, I had no idea where the journey would take me. I just assumed that all software engineers eventually became as good as Kernighan and Ritchie, able to create new programming languages at will, and that they had read every page of the “Introduction to Algorithms” book, with absolute mastery of the algorithms and data structures defined in there.

Yes, I grew up in the age of the internet, but from my experience with it in the early 2000s, I didn’t consider statically served pages, and a smattering of online shopping portals as “real software”. My ideas around how google and Amazon worked were naive, and I had given little to no thought to the complexities these services had to deal with, and how neatly it was all hidden away behind the simple pages I ultimately saw in my browser window.

I spent the first 8–10 years of my career as a software engineer building desktop-bound applications. I worked on device drivers for data acquisition and instrument control, built compiler features for model-based design tools, and developed user interfaces and workflows for software that was destined to be installed and operated on a desktop.

On this journey, I learned the tools and techniques to develop, maintain and debug software of this nature, and found ways to write code that would treat processor and memory resources on the host computer with care, like the scarce resources that they were.

In those same 10 years of my career, Software as a Service (SaaS) has exploded in terms of the number of software solutions and services deployed in that manner, as well as the complexity and sophistication of those solutions. There’s probably enough material to write a book (or a few) about that. As I see it, some of the catalysts for this are:

  • Massive increases in network bandwidth, and how easy internet access has become in most parts of the world. Many of us are fortunate enough to be able to access the internet as easily as turning on a tap of water.
  • The precipitous drop in the cost per unit of processing power, storage, and memory in the cloud, and the ubiquity of cloud service providers.
  • An explosion of open-source software tools and frameworks that have allowed applications and services to be built rapidly, making the ability to quickly build out an idea and offering it as a service, a lot more accessible. In other words, you don’t need to be an accomplished software engineer with a four year degree to approach this work anymore. A lot of the complexity is hidden away (for better or worse), allowing high level thoughts and ideas to be easily and quickly expressed, and then made available equally quickly because of the above mentioned factors. Anyway, I digress.

After a few years of desktop application building, I had this nagging feeling that I was missing out on entire categories of software engineering challenges, by not working on large scale SaaS. Outside of work, building a few small applications for my own learning, and experimenting in my own time is one thing. However, I was convinced that immersing myself in problems like this as part of my full-time job was the way to go, if I really wanted to learn how to build and operate software that was resilient and malleable, and that played / orchestrated well with other services.

It’s been a year since I started working on software in the cloud, and I wanted to share some of the differences (and similarities) I have observed, when developing software for the desktop (I’ll refer to this as “desktop software”) vs. a service or application hosted in the cloud (I’ll refer to this as “SaaS”).

A few notes:

  1. This isn’t meant to be an exhaustive list of differences between developing for desktop applications vs. engineering SaaS. These are just my personal (and subjective) experiences with them.
  2. This isn’t about espousing one way of doing things, or another. It’s just a compilation of observations in the short time that I have been building things on the “other side of the fence”. :)

Debugging application logic

From a software engineering perspective, debugging the applications I was building using C++, Java, MATLAB etc. especially for applications where I either owned or had access to the source code, was as simple as:

  1. Set a breakpoint in the source code
  2. Attach to process (if already running) or run the application in debug mode,
  3. Land at the breakpoint by executing the appropriate functionality, and start investigating.
Debugging, because who doesn’t love a good Heisenbug? Source: https://thedailyomnivore.net/2012/11/30/heisenbug/

For SaaS — The first time I had to investigate a bug in the Javascript based frontend locally, I spent hours trying to figure out (on my own, and asking colleagues) how to setup and launch all the right services locally, since the bug was in a part of the application which could only be accessed by performing a workflow that required the interaction of all those micro-services. At least in this case, the saving grace was that browsers come with developer interfaces to inspect and set breakpoints on Javascript code.

A few days later, I had to investigate a bug in Python source code in a Django based system that was containerized, and soon had the sinking realization that there was no out-of-the-box “Set a breakpoint in an IDE and hit Run” workflow, to stop at a breakpoint to inspect things. Yes, it is possible, and I figured it out, but it still takes more setup and preparation than I was used to.

The positive side of this is that I got much better at eyeballing code to find possible issues, and instrumenting code I wrote with logs and metrics to make sure that any bugs I introduced in my code could be found a lot faster by following the presence / absence of messages in our logging and metric collection services.

Determining usage and error issues via metrics and logging

Speaking of logging usage and errors, enterprise software running on a desktop needs to phone home, at least intermittently, to transmit anonymized usage statistics, crash reports or error logs. That is, if the user of that software isn’t behind a network and firewall that prevents this information from being sent back, or, even gave you the permission to phone back home in the first place.

For SaaS, my experience so far has been that at least for services and applications that aren’t in a private cloud environment, the provider of the service is free to make calls to a logging and metrics services to continually log information, as long as it pertains to the functioning of that application.

Why is the ability to log errors and usage metrics over the network so important? In desktop applications that are missing this functionality, especially in major / critical enterprise software, the only way you find out about major issues and bugs that were missed by multiple rounds of QA, is through a sudden influx of support tickets to your frontline support team. Ideally, you want to be finding out about these issues as they happen, and have a fix ready (or have a team working on it) before the first support call even comes in.

Leaving aside errors and crashes for a moment, in the absence of metrics, it’s hard to see how your user base is using new and existing features, assuming they have even discovered those features you spent months working on.

In the SaaS world, services like Sentry, Datadog, New Relic etc. make the tracking of logs and metrics a joy. Similarly, services like Optimizely integrated into your web application or service, allow you to run and track A/B experiments in real-time, enable or disable different experiments, or react to user issues in real time. The tight feedback loops of SaaS can be hard or impossible to replicate in desktop software, preventing access to quick user feedback around the usage of features.

Continuous Integration & Continuous Deployment (CI / CD)

Of course, those tight feedback “loops” I just referred to wouldn’t be possible without CI/CD. When I worked on desktop applications, producing a “release” was usually a major cross-team effort within the organization.

Planned releases often occurred months apart. There may be no process in place for “unplanned” releases, which means you’re forced to offer inconvenient workarounds for issues, or no solution if a workaround doesn’t exist. Even if there is a process for unplanned releases, it can take days, or even weeks to get the pieces in place necessary to produce a new release that fixes a critical bug, or an issue that’s causing a major inconvenience.

While some of this is a function of the organizational processes, size and culture, my experience was that for the development of desktop software, continuous integration was the best you could do. You could of course “continuously deploy” internally for QA or other internal uses, but without being able to continually deploy new bug fixes, enhancements and features to end-users, you didn’t reap the benefits of true CD.

Scalability in the face of unpredictable compute loads

I spent a great deal of time squeezing every bit of CPU and memory performance I could out of applications I worked on. Processor clock speeds have more or less plateaued, and there’s a limit to the number of cores you can squeeze onto a processor.

Besides, your software needs to be built to be able to take advantage of the parallelism that multi-core processors might offer, and that often isn’t the case. If you want performant software that doesn’t bog down a system, then you can’t play fast and loose with CPU and memory consumption. Therefore, when developing desktop applications, benchmarking for performance is particularly important.

In the SaaS world, you’re free to spin up multiple instances of your services when faced with a surge in load, and you can run select CPU or memory intensive services on compute resources with superior processor and memory capabilities.

Even so, performance benchmarking shouldn’t be losing its importance for SaaS. Once these services are in production, the automatic scaling (up or down) of these compute resources often masks runtime speed and memory consumption issues. I have seen cases of memory leaks going undetected for a lot longer than they would have, had those same services been running on a desktop. All because killing and spinning up new instances happened so automatically and easily.

It does take more discipline to design and implement for low resource utilization, especially for non-critical applications with no *hard* latency requirements, where it may not feel necessary.

Data Considerations

The storage and security of data had never been a concern for me, until now. Of course, this was unique to the application I was building. If I were working on, say, the desktop version of a software like TurboTax, that works with sensitive personal and financial information, having solid access controls in place would be necessary. In that case, protecting data access from malicious actors (both human, and other programs) would be the primary challenge.

In the cloud, many other degrees of freedom (to borrow a term from mathematics) open up when it comes to Data. Take for example data residency — the requirement for data to be localized to a specific geographical region. Systems and infrastructure need to be setup and configured to ensure these residency requirements are met. Failing to meet these requirements can result in serious compliance violations.

Similarly, for software solutions of a certain type, or in certain industries, clients may insist on physical separation or isolation of their data. There may be concerns around data leakage, or a desire to protect themselves from malicious tenants sharing the same physical infrastructure. This needs to be accounted for when designing the system.

Managing data access feels exponentially harder. At least when your software resided completely on the end user’s machine, apart from the due diligence access controls your application needed, the rest was up to the end user. On the other hand, for your software solution hosted in the cloud, all it takes is the production database or application credentials falling into the wrong hands. A rogue (ex?) employee, maybe a careless one, or the misfortune of being targeted by a sophisticated criminal could all lead to serious problems. In addition to all the technical steps you take to secure your data, there is a need for strong processes and a system of audits, checks and balances to ensure the security and integrity of data.

Wrapping up

There’s been a ton of learning over the past year. To be clear, I’m not “picking sides” or preaching one way of building software over the other. I chose to move over to building software in the cloud because I wanted to experience the engineering challenges that came with building resilient and scalable applications that were served this way. As an engineer who loves building user-centered software products, I’ve been enjoying the quick iteration cycles and tight feedback loops that seem to come with the territory on this side of the fence. There’s SO much more to learn, and I’m excited to see where that takes me.

--

--

Sid Shankar

Senior Engineering Manager @ GitHub ~ opinions are my own and do not represent the views of my employer.