The Cloud Landscape

Nikola Toshev
Sciant
Published in
16 min readOct 2, 2019

co-author Konstantin Vassilev

IDC cloud trends

There is a big shift to IT deployments on the cloud going on. Almost all(!) startups start on the cloud. Established IT companies migrate existing applications to the cloud. And yet the clouds as a whole are poorly understood. It doesn’t help that they are complex beasts with countless services. This article aims to give you an overview of the general landscape of the cloud services and point where to research for your specific needs.

The cloud can mean several different things:

  • Infrastructure-as-a-Service (IaaS), meaning getting virtualized computational resources under your control
  • Platform-as-a-Service (PaaS), where you only write your application, and you get a managed runtime (such as Heroku or Google App Engine)
  • Software-as-a-Service (SaaS), you only connect to an application such as Office 365.

In this article we’re going to talk about IaaS.

Ultimately clouds can be thought as computational resources available on demand. These resources can be basic, such as CPU cores, or complex such as a managed databases. They are programmable, which means that your system can manage itself and ask for resources for specific tasks, and deallocate these resources after the tasks are complete.

Economics is a very important aspect of clouds. If you are computing in the cloud, your efficiency is measured in units(transactions) per dollar. Clouds have a reputation for being expensive, but this is not necessarily true. In practice there is a trade-off between cost and flexibility, and most companies deliberately choose flexibility. They do not put efforts optimizing their bills, and that’s how these bills become huge (see 2019 tech IPOs, for example every ride on Lyft costs $0.14 in AWS expense).

Players

The three major cloud providers are, in order of popularity:

There are other clouds that want to compete with the big three, such as Alibaba, IBM, and Oracle. A wide range of hosting providers like Hetzner or Digital Ocean have also started adding cloud features.

Each provider has a wide range of services. Services overlap greatly between providers, but naming is inconsistent. You can find rough mappings of names of the same or similar services on different clouds here:

Elements of Clouds

The basic elements of clouds are:

  • Virtual machines (VMs)
  • Unbundled VM features. You can get network block storage devices to use instead of attached disks, VPNs instead of physical networks, or CPU+RAM bundles without persistent storage in the case of serverless functions (more on this later).
  • Kubernetes is becoming a common abstraction layer across public clouds and on-premises deployments, functioning like a cloud OS. It is built on containers, that could in turn be hosted in VMs or on bare metal. Managed Kubernetes is free on most clouds (you only pay for the resources containers use). It is widely recommend that you build new sufficiently complex projects on Docker and Kubernetes instead of raw VMs.
  • Hosted services. These are sometimes well-known open source projects hosted and operated by the cloud provider for you. More often the software itself is proprietary (sometimes forks of opens source software). To use these, you often need to pay for full-time dedicated VMs, even if your usage is minimal (e.g. a hosted Postgres database). This makes them unsuitable if your usage is low. More advanced offerings that are priced by the units consumed are called “serverless”. For example services like S3 object storage or Amazon DynamoDB) are charged by number of requests and the amount of data you read/write, a much more granular billing.

A cloud deployment can be managed in several ways:

  • Web interface online
  • Command-line interface (CLI)
  • API

All providers support all three (CLI is arguably trivial given an API). API is absolutely necessary in order to automate infrastructure. You should find yourself using the web interface only initially, when you are trying out different tasks manually, and automating them shortly afterwards. Your entire cloud environment should be reproducible with a click of a button.

VMs and associated services

Clouds are more than VMs, but they start with VMs. The difference between on-premise and cloud VMs is mostly fast and unlimited scalability. When developing cloud native applications, you can scale up (get a bigger machine) or scale out(get more of the same machines) your VMs within minutes. This means you should develop/test on the lowest spec VMs possible, benchmark different instances, and deploy full-sized VMs. In traditional data centers there are weeks between ordering new machines and actually getting them, therefore you need to overprovision a lot. Cloud deployments are flexible.

VMs may or may not have local disks attached. Network disks are used more often than local disks; these are mounted just like hard disks but reside on a network, they are still fast and more reliable than local disks.

Stopping VMs with local disks does not stop the platform from charging you as you still have not released the hardware for use by others. You have to terminate the VM, losing the contents of the local disks.

VM types

Clouds have predefined VM types with specific balance between CPU, RAM and local disks. Here is a table with standard instance types on AWS:

As you can see, doubling the resources doubles the price, so vertical scalability is a cost-effective option up to the largest available instances. The balance between CPU and RAM is fixed within an instance type, but there are different instance types with different resource ratio:

If you are optimizing cost, choose resource ratio suitable for your workload. GCP even allows custom resource ratio, within limits.

Block storage

Some VMs come with local disks that are fast but ephemeral (contents is lost when you terminate/stop paying for the VM). Usually you start with cloud-provided block devices for convenience and may switch to local disks as an optimization.

These block drives look like the local disks, but are mounted over the network. Accessing them involves higher latency and network bandwidth usage. They have higher durability and availability than regular drives — they are backed by replicated storage, so you’re unlikely to lose them in a hardware failure accident (concrete guarantees vary, e.g. AWS EBS are designed for 0.1–0.2% AFR ). The performance accessing them is also unlike a regular drive — the more storage you use, in GB, the better performance in terms of both bandwidth and IOPS you get (“Persistent disk performance is predictable and scales linearly with provisioned capacity until the limits for an instance’s provisioned vCPUs are reached”)). This is also because you are using a shared device behind the scenes.

Block storage provides better management capabilities in terms of snapshots that can be taken and restored. While you can take a point-in-time snapshots of a running instance, the consistency of data on disk depends on the software running on that instance and usually require stopping all writes or unmounting the disk.

Networking

Clouds are very flexible in terms of defining flexible network options for their instances. A typical setup would have user-facing load-balancers (or VMs) with public IPs, and all VMs of a given deployment talking to each other in a VPN behind the scenes ( Virtual Private Cloud(VPC) on AWS and GCP, Virtual Network on Azure, private network on DigitalOcean, just Network on Hetzner). The development / ops team then can connect to the VPN, then connect to individual instances and perform management activities. Do not allow your database or otherwise non-public-facing VMs to get public IPs - for security reasons!

Provisioning VMs

The recommended approach to provision VMs with your specific application is do it from a fully prepared image, produced by the build process. Use Packer to create these VM images from config file descriptions across virtualization providers. An alternative approach may be installation of the application on newly-started image with Ansible or Cheff. This is usually not recommended, because it mean it would take longer to start and scale instances.

Terraform emerges as an important industry standard for provisioning across cloud providers. It deploys not just the VMs but also all kinds of accompanying services like DynamoDB or DNS hosted by Namecheap. Avoid provider-specific tool like AWS CloudFormation or Azure Resource Manager in favor of Terraform. Terraform configuration files are not reusable across cloud providers, but the tool itself is.

Terraform enable the practice of having immutable infrastructure. The term refers to the recommended practice of not modifying your servers via ssh — instead, if you need to make a change, deploy a new VM/container and destroy the old one to avoid configuration drift. Configuration drift are small differences in the configuration of deployed infrastructure that accumulate over time and causes trouble.

Scaling

Most differences in developing applications for the cloud stem from ability to rapidly start and stop using resources, and pay only for the duration of their usage. Multiple cloud customers do it and their cumulative usage averages out, hence economies of scale cloud providers achieve.

The resource requirements of typical services grow gradually over time. Traditional provisioning requires you to provide resources enough for peak usage. To do this you need to estimate it and add a healthy safety margin on top. On the cloud you start using resources gradually and pay for what you use. You still need to estimate peak / typical usage at load in order to estimate the cost, but you don’t pay each month for the full capacity.

To do automated scaling, you need to define a metric which determines how many VMs of a particular type to utilize. This is often harder than it seems. You could use a business metric (e.g. transactions per second per instance) or a resource metric (CPU utilization). Resource metrics are easier, but business metrics avoid trouble when resource usage profile changes. For example, the workload might shift from CPU-bound to I/O-bound. Business metrics are more resilient against such changes.

The cloud provider will spin up more instances to keep the desired metric within specified boundaries when the load increases, and shut down instances when they are underutilized. Note that this is not an instant correction: deploying and decommissioning instances take time on the order of minutes, so you need a safety margin, or your service may become unavailable or degraded for some customers.

Functions-as-a-Service (serverless)

Functions as a service allow you to run computational tasks without keeping a VM running. These are paid in GB-seconds for the running time of the function, e.g. for AWS Lambda:

CPU and other resources scale proportionally to the RAM, for example on AWS 1792 MB RAM getting exactly 1 vCPU. FaaS are implemented on top of containers, which determines a number of their characteristics:

  • The cloud provider defines several runtimes you can use in FaaS — usually at least node.js, Python, and Java. These are the stacks for which the provider has implemented container packaging. Currently only AWS allows custom runtimes, where you provide the packaging.
  • You pay nothing for a FaaS that is not in use, and have infinite scalability (a large number of functions can run simultaneously). These functions don’t share any state in RAМ. Any shared state must be stored in a separate database. This could be object storage like S3 or a serverless database like Aurora, Azure CosmosDB or Google Cloud Firestore — a VM-based database is also fine, of course, but you will be paying a baseline cost of full-time VM running a DB. Starting a function for the first time involves higher latency (10–200ms for the container and potentially JVM/CLR startup if your code runs on them). Your function will be cached on a specific machine for subsequent runs. If you need to scale up and run a second function in parallel you will incur the cold start latency again.

Understanding the cost of FaaS can be tricky. People expect it to cost magically less than VMs, which is not true:

  1. The cost depends on the amount of usage. Rarely running Lambdas cost much less than a full time VM with the same CPU/RAM resources. FaaS that run full time cost more than a VM with equivalent CPU/RAM.
  2. I/O bound workloads are not cost-effective in FaaS. If you run a web crawler, with each page fetch being a single Lambda call, the cost would be astronomical compared to the same job running in a VM. This is because each Lambda function will spent most of its time waiting for a response, and you will pay for the time each Lambda waits. A VM will wait for multiple responses simultaneously, allocating and paying for CPU/RAM just once for all calls that are waiting.

Managed services

Clouds claim to mostly eliminate the need for operations teams that administrate your running software. When you are starting and have low volumes, it makes a lot of sense to use the pay per use managed services. Cost optimization can come later.

When Amazon has a team managing, say, a MariaDB databases, they build expertise in the general use cases and the whole range of availability problems than may arise. If your workload is an edge case, don’t expect your database to be performance-optimized for it.

The big cloud providers have marketplaces where you can buy managed services delivered by companies other than the cloud provider itself. For example, this can be an Elasticsearch cluster managed by Elasticsearch the company. You can also rent an Elasticsearch cluster managed for you by AWS. People generally prefer a running service to installing and running software themselves, and default to the cloud provider for its management because that’s easier. This has caused several open source companies to alter their licenses in an attempt to restrict AWS and other cloud providers from offering a managed service based upon their open source code (e.g. Redis Labs, Elasticsearch, MongoDB, Confluent). This strategy has generally not been successful as cloud providers can and do still offer services based on forks of previous versions, before the license change.

Relational and open source databases

Cloud providers offer managed versions of popular databases. They could be open-source like Postgres and MySQL, or commercial like MS SQL server and Oracle. You can also get MongoDB-compatible and Redis-compatible solutions (see the licensing debate above). The official MongoDB and Redis are available in the marketplace from the developer as well. In all cases, the offerings are typically highly reliable replicated configurations with managed backups (the price reflects running multiple VMs for replication and availability).

For relational databases you are still responsible to fill some of the DBA role (performance monitoring and index creation, for example).

Cloud-native databases

Google, Amazon and Microsoft have all built large scale databases themselves and they offer them in the cloud. When selecting a database, the usual complex trade-offs are involved. Here is a diagram outlining the choices on GCP:

Google is a pioneer in big data management and processing systems. A big chunk of the industry is based on open source implementations of their research papers. Many of them are now only available for renting on their cloud.

Here are a few of the choices less known from the pre-cloud world across the other cloud providers:

  • Amazon DynamoDB (a classic eventual consistency database, simple, low latency, and scalable, but not strongly consistent and not without support for JOINs)
  • Amazon Aurora (a modern scalable engine with highly replicated storage and compatibility layers with Postgres/MySQL)
  • Azure CosmosDB (a modern general purpose cloud database by Microsoft)

File/object storage and CDN

Cloud providers offer distributed replicated storage of objects: AWS Simple Storage Service (S3), Azure Blob Storage, GCP Cloud Storage, DigitalOcean Spaces, and storage-only providers like BackBlaze B2. The sweet spot for using these services are storage of files, bigger than a megabyte. They are accessed via REST APIs and can be directly referenced in a web page (static assets of a site are routinely stored in S3 and referenced directly, making S3 operate as serverless web hosting). As an extension of this use case, cloud providers integrate Content Distribution Networks (CDNs) that automatically cache the S3 objects close to the end user and facilitate fast loading in the browser — e.g. AWS CloudFront.

Alternative use cases are:

  • Archival/backup storage. Object stores provide options to balance the replication factor with cost, or access readiness with cost (nearline storage is for rarely accessed files, cold storage is for files that once written are almost never accessed, such as backups). Versioning and fine-grained access control is available. For example, it is possible to create an S3 bucket whose objects transition to Glacier (the cold storage) after 60 days and expire after 365 days. The backup process uploading the data could have permissions only to create new versions of objects, but not to touch older versions, therefore avoiding the risk of an attacker compromising old backups.
  • Raw data storage. S3-like services are often used to store raw unprocessed data, so that processing can be run reliably and repeatably on them. Some cloud services provide capabilities to process data in object storage directly, for example Amazon Athena or Google BigQuery.

Cloudflare, which is normally a CDN provider, offers interesting FaaS functionality in addition to their main service. They have fast and cheap service workers written in Javascript and a key-value storage deployed on their edge servers — meaning they are low-latency-accessible from your users around the world, making it good for use cases like API management, A/B testing, etc.

Logging and Monitoring

Cloud applications are often a combination of multiple VMs/containers, data stores and other managed services. With more complex architectures it becomes impractical to ssh/remote into individual machines or services to figure out if something is wrong. Diagnosing system-wide problems looking at logs and metrics spread across multiple machines or services is often very time consuming.

Two ideas make monitoring and finding bugs easier: Centralized Logging and Structured Logging.

Centralized Logging

Having all your application logs being sent to the same system helps a lot with understanding how your application behaves, identifying bottlenecks and finding bugs. Usually a centralized logging solution consists of three parts:

  1. Ingestion layer — takes care of receiving and buffering the logs, maintaining order and sequence numbers.
  2. Storage — persistent and durable storage of log data
  3. Visualization and analytics — using the log storage allow visualization, filtering and searching of logs, as well as creating graphs based on log data.

One of the most popular stacks for centralized logging is the so called ELK stack — consisting of Logstash (ingestion), Elastic Search (storage and search) and Kibana (visualization).

Cloud providers also have built-in ingestion and storage layers that are already integrated with their other managed services:

So for example, if you use an AWS RDS, its logs and metrics automatically land in CloudWatch.

Structured Logging

With many services and VMs logging to a centralized location, making sense of them becomes a challenge if logs are simply plain text strings. Structured logging is designed to address that, by adding structure to log messages to facilitate easier filtering analytics based on log data.

JSON format is commonly (but not only) used for structured logging, so we will use that as an example of a log message:

When structured log data is indexed (e.g. in Elasticsearch), it becomes very easy to make queries to the log data such as:

  1. Give me all logs for user X in the last 12 hours
  2. Give me all logs that are with status 500 on a given endpoint
  3. Give me all ERR logs from a certain region or service
  4. Show me the percentage of 500 status codes over 200 status codes

Adding centralized and structured logging to your app will help greatly with finding problems long before the users do. Good, structured logs can in many case be used to also provide support to users if support staff has access and is trained to use the analytics features of the logging system.

Regions and Availability zones

All cloud providers have data centers in several geographic regions across the world. Use the closest to you for reduced network latency, or select location for legislative reasons.

The cloud providers have multiple availability zones in a single region. Theoretically system failures in one availability zone should not impact other availability zones in a given region. This is not always true, but still true in most cases. You may choose to have replicas of your VMs in different zones, behind a load balancer. This increases availability, however you will be charged for network traffic between them, and the requests will have higher latency.

Cloud costs

Cloud costs are often non-obvious and therefore need to be monitored. The most important thing is to define a budget with your expected costs and alerts whenever this costs are going to be surpassed.

The default cloud configurations tend to make running costs opaque except for the account owners, while you generally want cost to be transparent for the people doing the actual resource allocation. Configure your accounts so that developers can see the cost of the services they consume.

Separate different projects in different accounts using AWS Organizations and set different budgets and warnings for each.

VMs normal price is “on demand” — instances you can start and stop any time. The cloud providers give you a way to get a discount if you run the instance full time — you can use reserved instances, or in the case of GCP, an automatic sustained usage discount (though you can still reserve and this would give you a better price). Cloud providers also offer their spare VMs cheaply (e.g. 20% of the cost of on-demand VMs), but on the condition that they may be terminated at any time. When there is high demand for the compute capacity your are using, your machine will be stopped by the cloud provider. These machines are tricky to use and it makes sense to look into them only after you have started to use significant resources. Companies use them for large batch computation with flexible deadline or in a mix with on-demand instances in a kubernetes cluster (where Kubernetes will compensate for the killed VMs, by automatically requesting on-demand instances).

Keep in mind that in the cloud you pay for allocated resource even if you’re not using it. This can be something relatively obvious like a VM with local storage that is stopped but not terminated (therefore preserving the local storage), or more obscure like reserving DynamoDB read/write capacity which you’re not actually using. Use automatic deployment with Terraform to setup and tear-down your entire clusters as recommended above — it will be much harder to leave unused but allocated resources around if you start and stops the whole cluster at once. If you do forget resources and get a huge AWS bill, contact their support — AWS is known to waive such bills.

Conclusion

Infrastructure as a service is the modern way to build services from scratch and to run some of the older services. You can gain a lot from the cloud, but it also requires a bit of a shift in thinking.

--

--