Scale your application and software process from 1 user to 1 billion users

~5000 words ~200 slides ~30 minutes

sasa.kovacevic

Published in

THREE DAYS AWAKE

27 min readMay 10, 2016

Level 300, advanced material. Requires an in-depth understanding of features in a real-world environment, and strong coding skills. Provides a detailed technical overview of a subset of product/technology features, covering **architecture**, **performance**, **migration**, **deployment**, and **development**.

Before you dig in, here is a quick info on how to reach me — Twitter | LinkedIn.

Let me know what you loved, what you hated, what left you unimpressed and what you want to know more about.

If you’d prefer to listen to great music while reading, check out SomaFm.com
(no affiliation, not sponsored, just — greatly — enjoyed).

Why this topic?

In today’s fast-paced modern development cycle you simply cannot be left behind, or you may never recover. We are living in a different world now.

As Scott Hanselman succinctly puts it:

Peter Sankauskas clarifies:

Ignore this at your peril. You won’t be able to reach the best talent in the market, and consequently you won’t be able to reach the full potential of your users and delight them.

You need to do everything in your power to be where your users are,
as Satya Nadella muses:

Can you really achieve these goals with your current development and operation practices?

Well, let’s find out…

But before we do — I’ll talk (write) about a lot of topics here, but don’t worry, most of them will be mentioned in passing only, as I’ll go into more details at some other times. Just be ready to think fast and try and pick up various tidbits here and there, so that you can dig deeper on a particular topic when the need arises.

Also, best practices described here, do not necessarily invalidate your own thoughts or patterns, but you better have a very good reason from deviating from these. Sometimes it is absolutely necessary, but the end goal is to embrace the common patterns and get rid of any case by case variations that are hard to handle in the long run.

On-premise

Let’s consider the current, typical, on-premise application and the software process.

This is not a good way to start if you are building an application today that you know will need to scale, but suppose you don’t know. Perhaps you are evaluating your options. Fine.

So you start with a small team, a few requirements and try to build a proof of concept application. You build it with a few developers by not really using waterfall methodology as specified, but when the app is ready it turns out that inadvertently that is what you end up using. Perhaps (may god help you) some sort of waterfall is mandated in the company.

For this application any methodology will do. Just get the app jump started, bring it as quickly as possible to a minimal-viable-product status.

A Minimum Viable Product (MVP) is: “[the] version of a new product which allows a team to collect the maximum amount of validated learning about customers with the least effort” — Eric Ries.

You will deploy this application on your own: in the basement, on the top floor, or simply to an old computer you have just sitting next to your feet.

The entire application is hosted on a single computer in your enterprise building, web and database and all the other components. You allow a few user to it — internal, partners, external — doesn't matter.

You do have a firewall (at the very least) between the application and the world, right?

If you do **< this** instead of **this >** you are beyond any help I can provide, except — don’t, please don’t.

I am planning to go into more depth on privacy and security in the future, but for now — I’ll just mention security related issues from time to time so we won’t forget about this important aspect of application design.

Hosting provider

So, your application is getting at least some traction? Great. Let’s move it from your old computer on the floor, to a reputable(!) hosting provider.

Everything changes, everything stays the same.

What you’ve done now is you have just taken your existing application and replaced your own computer with the one you rent from a hosting provider. The web application and the database are still on the same machine. But now if the machine dies your hosting provider will replace it.

I am assuming you are not using a shared plan, of course… Privacy, security and application design are quite different if you are on shared infrastructure… and we’ll soon go into more details on this.

You did remember to set up your own backup, didn’t you? You are not just depending on the backups of your hosting provider? Right? Right? Again, just some common sense.

Cloud

Before going through the steps of moving your application to the cloud, let’s just go through a few things first — to familiarize ourselves with the cloud.

Azure will be in the focus, but AWS (Amazon Web Services) is just fine as well. Anything other than these two and you better have a damn good reason — partnerships, cross promotion, focus on the same core business, or anything really compelling (and you better be ready to defend your decision during every step of the way).

Cloud provider: Microsoft Azure. IaaS (Infrastructure as a Service).

We’ll start with IaaS solution first, but just as a reminder, consider the differences between Infrastructure, Platform (PaaS) and Software as a service (SaaS).

IaaS, PaaS, SaaS solutions contrasted with the on-premise solution.

As you move more to the right (not the political right, but the right side of the slide — not insinuating anything here) you give up more and more control, but with the benefit of having more and more focus on your core business.

With IaaS you give up on managing underlying networking, storage, servers and virtualization solutions. With PaaS you also give up the underlying operating system, database and security and integration management. And finally with the SaaS you also give up control of the applications and even some of the data, that is now managed for you.

And all of that is good. Great actually! Only if you have special needs that the vendor cannot fulfill should you ever consider managing networks, storage, servers, virtualization, operating systems, databases, and security and integration (not talking about your application, you still have to manage security and integration, only talking here about the underlying infrastructure).

Why the cloud?

Do not want to go into too much detail here on why cloud and not on-premise (or hosting — shared or otherwise), except to give you one (usually cited, but bare with me — I have a more compelling way of showing this) reason why clouds wins hands down. Even if it were not for the other reasons, this one must convince you to go to the cloud for your application scaling and development needs.

Consider the following… You must estimate the hardware needed for your newly created application. Let us assume you have done this many times before any you are really good at it. But the risk is still there, will your under provision? Over provision?

Well, here is why even if you are 100% correct in your estimation — you are still — WRONG.

But first, you — overestimated.

As you can see, the area of computing power you have is so much greater than what you will ever use, so much so that you’d be hard pressed to justify your decisions. You will be paying for all this extra capacity that — you — will — never — use.

OK. But what if there is an even worse situation that can occur. According to Murphy’s law…

Now you underestimated the capacity needed. And it’s even worse than before. Now you are unable to meet the demand — at — the — most— critical — time, when you are making the bulk of your money from the application.

And, now — the perfect estimate.

I hate to be the one to tell you, but look at that area. Just — look at it. You did everything right, you estimated perfectly and yet — you are still — wrong. You are now spot-on on the peek, but all that wasted computing power during the rest of the year is surely not the best solution, is it?

Can we do better together? Yes we can! By using the power of —the cloud.

Why is the cloud solution superior to any other? No waste! Can handle any demand you throw at it! We can even always ‘over-provision’ slightly just to have the best application experience for our 1% load as well as 99% load. Everyone gets the best possible experience.

The application works and handles the same for 1 user as well as for 1 billion users. This is our goal!

Are you convinced? Of course, you’d have to be insane not to be.

But —

Q: But what if we need so many servers Azure runs out?

Ah, there is always one… one who cannot really believe that this sort of computing power is possible, let alone within one’s grasp — and for cents on the dollars/euro when compared to on-premise solutions.

Q: But what if we need so many servers Azure runs out? A: You won’t.

Consider the game TITANFALL. They host the gaming sessions in Azure and are using >100 thousand virtual machines (VMs) to achieve this.

Will you be needing more that 100 thousand machines for your application? I doubt it. Because otherwise — Microsoft would have you on speed dial.

So forget about that worry altogether, and worry how you will delight your users with your application. Now that is something you need to figure out — before needing 100 thousand VMs.

Infrastructure as a Service

Let’s get back to your application and move it to the cloud.

You’ve moved to the cloud and are now hosted on Azure.

You’d do well to revise you original development plan.

I am using Visual Studio icon (the best IDE (integrated development environment) in existence today (for me at least) and I’ve worked with them all — Eclipse, Xamarin Studio, NetBeans, SublimeText, VIM, Atom, CodeAnywhere, etc…) when talking about development and operations.

But here, we are IDE agnostic. Choose the one you love, enjoy using and that does the job for you.

You might consider going from unorganized or waterfall approach to scrum (or agile in general). Operations team is not the same team as the development team. Requirements now come from project stakeholders.

You also won’t invest any amount of money in support, because for now you are just testing the waters. Any questions you have you will have to address through internet searches. Not great, but one must learn somehow, and learning by doing in the cloud is simply the best.

This is your typical development workflow:

You have a plan. There are people in the dev team. There is also QA. And there is the final act of release.

However, bugs in any part of the application delay release of the entire solution. QA always has a lot of tests to perform.

You will know that you are reaching the limits of this approach when QA starts delaying releases due to the fact that they haven’t had the time to properly test the application. The wrong solution to this is always a slow down of release to QA, but in fact it should be the other way — release faster to QA, more often.

How will this help? With faster releases to QA they will have less to test, less to analyze, less to troubleshoot. And your actual time to release will improve.

But still, your site is just one VM. We’ll have to consider how to scale it when the users come swarming.

One solution is to scale up.

Scaling up is great at this point, but only in so far as it gives you room to maneuver and prepare your application for scaling out.

Scaling up?

Scaling up means beefing up the machine. Not enough memory? Increase it! Not enough IOs? Increase them. And so on. However, it cannot go indefinitely. Firstly — beefier machines are comparatively expensive, and secondly — there is always a limit.

Here are some of your options for scaling up:

When 32 cores, 448GB of RAM, 6TB HDD is not enough… what then? And that is just one of your many problems? You haven’t yet even considered fail-over and redundancy.

Well, Dorothy, it’s time to leave Kansas, and never ever look back.

Get rid of virtual machines. Let Azure manage services.

Platform as a Service

A quick reminder… IaaS, PaaS, SaaS solutions contrasted with the on-premise solution.

Here is the application now:

Finally, we make the transition from a single machine with everything to a services architecture. Database is now an Azure service. Web application is an another Azure service.

You might have noticed that since we first switched to Azure, load balancer was present, but it was pretty useless until now. Now the load balancer will allow your application to scale — for the first time — out.

Before going into that consider the development best practices at this point in time. There are quite a few more developers, as you are working harder now to satisfy and delight the users. You started paying for support, because you need to be able to report problems and have them handled quicker.

If you haven’t so far, now you really must have proper logging procedures.

Before we scale out we have to consider the way the application is built. This means talking about stateful versus stateless protocols.

A stateless protocol does not require the server to retain session information or status about each communications partner for the duration of multiple requests. In contrast, a protocol which requires keeping of the internal state on the server is known as a stateful protocol.

Stateful doesn’t scale as well as stateless does.

Consider the same example from the stateful and the stateless perspective.

The issue with stateful protocols (as illustrated with the animation above) is that the load cannot be distributed well across the different instances since it must retain information outside each connection, so a connection to one instance is stuck there until the entire workflow is completed.

And if one of such workflows either goes rogue or is heavier than expected, it not only impacts that workflow but all other workflows on that instance.

The instance can be scaled up, but only up to the limit — and when the instance breaks under load, then the problem simply escalates to the next instance. Technically a single rogue workflow can bring down an entire fleet of instances.

The stateless situation allows us to better defend against this.

To handle the load, with stateless protocols, we first immediately distribute the workflow across instances since each call is a separate entity in its own right.

Then when the issues with a rogue workflow or a heavy workflow, we can manage it by simply increasing the number of instances.

This is a simplistic view, because if there is a rogue workflow we want to limit its impact through other means, but if it is just a heavier workflow that we wish to handle, with stateless protocols we can.

This does not mean that you cannot have session information or other information that is shared across the calls to different instances, it just means that you keep and handle this information — elsewhere.

Where exactly?

The usual suspects in this case are the database and/or a cache (Redis or Memcached are commonly used).

Consider the case of the a shopping cart. With stateless protocol you can allow each request while the user is browsing to be distributed to a different instance, while you keep the items that the user has added to the shopping cart in the cache and/or persisted to the database that all instances can ‘share’ and use at the same time.

Now would also be a good time to scale the web app, the database and the cache separately.

Caching patterns. Circuit breaker, background data push, cache aside.

Popular cache population strategies are circuit breaker, background data push and cache aside (on demand cache as it is also called).

More information on those can be found on the asp.net website. Here are a few excerpts from that link.

Circuit breaker: “The application normally communicates directly with the persistent data store, but when the persistent data store has availability problems, the application retrieves data from cache. Data may have been put in cache using either the cache aside or background data push strategy. This is a fault handling strategy rather than a performance enhancing strategy.”

Background data push: “Background services push data into the cache on a regular schedule, and the app always pulls from the cache. This approach works great with high latency data sources that don’t require you always return the latest data.”

Cache aside: “The application tries to retrieve data from cache, and when the cache doesn’t have the data (a “miss”), the application stores the data in the cache so that it will be available the next time. The next time the application tries to get the same data, it finds what it’s looking for in the cache (a “hit”). To prevent fetching cached data that has changed on the database, you invalidate the cache when making changes to the data store.”

Now, let’s return to the application… As with the VMs, web app service also has a ‘few’ size options.

This allows you to choose the correct type for the job. Need more processing power, more memory, more IO, more storage — there is a type available.

That takes care of caching and the web app type. But what about the database? Which type of database to use, also depends on the job you need it to perform.

Gone are the days of having one size fits all database for the project. Just as some parts of the application are better suited for the traditional SQL database, and some for a NoSQL database, there are also very specific workloads that are best suited by something other than a database.

Always remember, the cloud provider has many service that you can use, at a fraction of a cost so it is worth investigating all the options. More about this later on, but for now the question is — SQL versus NoSQL. Or — is it?

Managing the database load. SQL and NoSQL.

While SQL databases have certain advantages (such as being long running and very well tested and established, decades of patterns available to use and can handle really well structured data) and can generally work well for 1–10 million users depending on the structure, NoSQL databases have advantages too (such as handling highly non relational and not structured data, and massive amounts of it, with rapid ingest of millions of records, as well as enable low latency needs of the application).

So, let’s not pick one or the other — let’s pick the right tools for the right data and loads.

At this point in the run up to handling a billion users, operation and development teams must be very closely connected, and cooperate to handle massive amounts of traffic ahead of us.

Redundancy and speed become more of an issue, as users from all over the world connect to the application.

We need to now transform from **< this** to **this >**.

The application needs to be redundant and fast for the users. Achieving this is hard, but there are ways. Start by moving all the static content to CDN (content delivery network) as well as in some cases, some dynamic content as well. Make the database globally redundant, and cache locally available. One location / data center is no longer enough. Go where your users are. EU, Brasil, US, Asia? For a truly global presence, be everywhere.

Why is this important?

If a site doesn’t load in 3 sec, 57% abandon, 80% never return.

Next we need to simplify development and operations. How?

Scaling front-end separately from the back-end is now the next goal, so switching from web app to web and worker roles is one possibility. We’ll discuss other possibilities later on.

At this point if you are not running multiple teams with multiple projects (solutions) you are doing something very wrong.

Requirements used to come from you, then your stakeholders, then users themselves. While you still need to listen to your users, your decisions from now on must be guided by raw data. Users may tell you they like or hate a certain option but only raw data can tell you is that true, and why that is so.

You need to have 24/7 support available to you with a response time of under 2 hours. You must collect every possible analytic. One that usually gets neglected and is very important is — which feature is used by how many users, how frequently and how is it performing. Others are errors, slowdowns, availabilities, statuses, etc…

You must know there is a problem before your users do, and then push this so that you know that there is going to be a problem before there is a problem.

Monitoring and health-checks are mandatory. Dashboards, counters and diagrams are going to be your best friends.

From the security standpoint, you must really push for penetration testing as you are now so big that you will be targeted if you haven’t already and you have so many users that even accidental security is a concern (security issues that your own users may stumble into). Security, privacy and compliance teams are now a part of your life.

Since we switched to multiple data centers, this is your last chance to get your infrastructure under control (if you haven’t already). Version control it, and test it regularly. Nothing should be added or removed from your deployment (not infrastructure and not services) if it is not done through version controlled scripts.

Now for some more information about the data that will drive your application and your business going forward.

Observe to find the customer pain point, analyze the data you are collecting, decide quickly on the course of action, implement to proposed solution, then observe to find the customer pain point…

Monitoring and analyzing the situation 24/7 is a must. Agile development all the way — sprints, stories, tasks, bugs. 1 click build. Automatic tests. Automatic incremental deploy, automatic rollback if needed.

Finally DevOps

When using cloud resources optimize for speed, not cost, because if you optimize for speed and you speed devops enough it will cost you less. Not to mention, fewer meetings, fewer developers needed, fewer delay and less technical debt, making it easier for new people to join development efforts.

How to achieve this? Number one is to stop developing monolithic applications. Rather, think about decoupled and idempotent applications, meaning (in short and simplifying somewhat) that the application components are decoupled as much as possible from one another and that the same action invoked with the same parameters will give the same result every single time.

**< Microservices** or **containers >**.

In computing, microservices are small, independent processes that communicate with each other to form complex applications which utilize language-agnostic APIs. These services are small building blocks, highly decoupled and focused on doing a small task, facilitating a modular approach to system-building.

Docker containers wrap up a piece of software in a complete filesystem that contains everything it needs to run: code, runtime, system tools, system libraries — anything you can install on a server. This guarantees that it will always run the same, regardless of the environment it is running in.

You can even combine the two.

Microservices, when done right — are in their very nature are the epitome of decoupling and idempotency. They allow different teams to develop and deliver at the same time, independently of each other, while still allowing for quick and easy integration into the overall architecture.

Containers allow fast development and ensure that no matter where you deploy (on premise, hosting, Azure, AWS, etc…) you will have the same environment for your services to run on, further abstracting the underlying operating systems and its services.

A large number of docker containers live for 0–1 minutes which means scenarios such as (quick) test, build, process, are now the new black. Mesosphere provides high availability, redundancy, scaling, analysis, …

The software process now needs to change quite a bit.

You need to have support available within the hour.

You need to improve on the metrics you already have and start collecting everything you possibly can, as you will be unable to diagnose anything without data. And how can you know what data will help you in the diagnostic process? You can’t know. Which is why you collect everything!

Your services need to be able to gracefully degrade if there is an issue somewhere in the system, and the services need to be self-healing.

Long gone are the days of data being in plain text, you really should, nay, need— encrypt everything, if you aren’t already. And this does not mean just from the outside to your network, it means also within your network (yes, even the parts ‘not visible’ form the outside).

Ensure that if a service has issues, it does not impact the overall system (other services).

E.g. if search does not work, fall back to cache, or top results or remove the search option temporarily. If external service has issues make sure you have an alternative, e.g. social sign in (Facebook, Twitter, Google), otherwise your users will be unable to sign in due to a failure of a service you have no control over.

Make sure not to D/DoS (Distributed/Denial of Service) your own services. E.g. if one fails do’t allow other to hammer it with request so hard that it can never recover. Check for service restoration with an increasing and randomized interval.

If you offer SLA (service-level agreement) check the SLA dependencies on external services. E.g. social login.

Encrypt everything, in transit and at rest. There exist patterns for accessing/browsing/searching encrypted data.

As the development team is now the operations team — devops is finally here.

Data determines the plan to be implemented. Each team delivers independently and no longer do bugs in one component delay all others.

As the ownership for issues has changed from the operations team (previously) to the devops team (now), each team is now responsible for their code end to end, from analyzing data to implementing features/fixes to making sure the code runs in production with no problems.

This has an added impact of bringing the code quality up, due to the fact that if anything goes wrong the developer (and his devops team) will be called uppon to handle the issue.

Escalation procedures ensure that first the devops team responsible will receive an alert from production. In case they are unable to find/fix the cause, automatic escalation procedures ensure that other parts of the organization can be called upon to jump in and assists if needed, or at least be warned about an issue that may impact their components as well.

Communication is key. Internally. And externally.

You can take my word for it or not, but if you are a developer, you will work that much harder to avoid being a part of the call depicted in the slide above. And you will make sure to improve your code quality to have no issues (ideally) or at least decrease them as much as possible (realistically).

This will also contribute to test driven development culture being enthusiastically embraced by the developers.

Today we built Shopify 500 times, deployed to prod 22 times, peaked at 700 build agents, spun 50k docker containers in test and 25k in prod.

Amazon.com deploys every 11.6 seconds (weekday). ~ 0.001% of deployments caused an outage.

A simplistic overview of what your version control should look like/allow:

You need to have a master branch (live in production), staging (final testing), development (for feature integration from the many feature branches).

Hotfixes are applied to staging and then incrementally rolled to production. When testing, the environment you set up needs to be identical to the production environment. And when deploying to production you slowly allow parts of production requests to come in to staging environment and check if it handles well. Then start increasing the percentage of requests until 100% are handled by the new code.

Software as a Service

Stand on the shoulders of giants — and it will allow you to focus on your core product.

Database options

You need to choose the right tool for the job. In case of the database, you have many options. Remember, don’t try to shoehorn all your data into a one size fits all option. But find the right combination that works for the data you need to save/process.

Relation database: SQL DB, Oracle, MySQL, Postgres, …

Key/Value: Azure Table Storage, Redis, Memcached, … Column: Cassandra, HBase (both based on Hadoop framework), …

Document: DocumentDB, MongoDB, RavenDB, CouchDB, … Graph: Neo4J, …

At this scale you will need to partition your data as well, either vertically (separate the data — metdata in DB, blobs in blob storage), horizontally (data by name(hash of the name, actually!) — or: users from different countries into different databases/database tables).

Plan for failure

Double down on data centers in each location, test, then test some more graceful degradation. Be the first to know when your SLA is exceeded and properly communicate with your users. Common initial patterns: Twitter feed of health status changes and a health page elsewhere — Tumblr (as long as it is outside your own system).

Monitoring

Use all tools you can find — Azure management portal, NewRelic service, … Be able to turn on DEBUG/TRACE on with a click. Other analytics — always on, async, to queue. Automatically detect anomalies in the system (e.g. database calls taking longer than expected). Use alarming tools such as Pagerduty to simplify your on call alarming.

Caching

Cache more. Then more again. Then everything you possibly can. Then consider what you previously thought impossible to cache.

Queuing

Queue is one of the best ways to decouple services.

Communication

Use the best tools to communicate within your team (Slack, Visual Studio Services (project rooms)), as well as between your business and users (Twilio).

Additional services

Azure marketplace. Other services on the internet (e.g. PowerBI).

Compliance, Security

Use key vault to keep your keys safe and secure.

Starting a new application? Here are the (generalized) best practices to allow you to scale to 1 billion users!

Stop unknown persons on company premises.

Are you still here?

You can reach me on — Twitter | LinkedIn, as well as right here on Medium.