The Road to Cloud Native: The Best Practices to Design and Build Cloud Native applications

DevOps changed the way we develop, build, deploy, secure and monitor software, however, there is nothing as a magic wand that solves all of modern IT problems. Also, there is no unique way to approach DevOps and implement it within an organization since it depends on many factors like the legacy, a burden for most medium and large companies. It also depends on other factors like business strategy, business priorities, and culture.

The Power of IaC

Continuous development, testing, integration, and delivery are; for sure; amongst the important pillars of DevOps but if you want to create a healthy DevOps strategy, one of the necessary and vital things to do is using the power of IaC and Cloud Computing.

In other words, becoming “Cloud Native”. This is what we are going to see in the following analysis.

IaC is simply managing hardware and physical resources using code.

IaC is not cloud computing, both concepts have different meanings even if there is a deep link between them.

The elasticity of cloud resources and the disposability of cloud machines make IaC meaningful. In a single press of a button, you can create hundreds of provisioned machines. The same thing would need a team of system administrators some years ago.

Cloud is not necessary using AWS, GCP or any other cloud provider, you can build your own cloud, or you can simply use these providers.

If we look into the simplest concepts in DevOps like continuous integration and delivery, we automatically think of delivering small chunks of software to be built, tested and deployed.

Tests are run against an evolved version of the software that is not really very different from the previous one: “the diff is not huge”.

Some companies with advanced levels of technical maturity tend to deploy hundreds of times a week.

Github deploys dozens of times a day.

SlideShare deploys up to 20 times a day.

Other companies like Amazon make more than 20k deployments per day (source ITRevolution.com)

This means every 4 seconds there is a new deployment.

Don’t be upset when you compare your company deployment frequency to the frequencies above, it is not about maximizing the number of daily deployments and it’s not a race.

Whether you deploy 10 times per day or 100 times per second, cloud computing is a must because it can be leveraged using code.

It allows you to “rent” the computation, storage, networking, and the necessary resources to run ephemeral testing and staging environments that disappear when you don’t need them. You can interface with your cloud provider using your favorite language (or DSL).

Welcome to Fight Club

The Things You Own End up Owning You ~ Tyler Durden

I don’t know if you watched Fight Club, but this quote is one of the things that marked me in that great movie.

We are neither talking about the “having vs the being” nor seeking the paths of wisdom here, but if you apply the same quote to your on-premise infrastructure, you will understand that

The Server You Own Ends up Owning You

The advantage of cloud computing is that we never have machines, we always supply resources on demand.

The 6 R’s of Cloud Migration

If your company is already running its workloads on the cloud, make sure to be “Cloud Native” to benefit from the power of the cloud, but if you are in the migration phase which is the case for most medium and large companies, the preliminary step here is identifying your migration strategy.

In 2011, Gartner identified 5 migration strategies. AWS identified 6:

  1. Rehost
  2. Replatform
  3. Repurchase
  4. Refactor
  5. Retire
  6. Retain

Rehosting (Lift and Shift)

Rehosting is about the redeployment of application and data in the cloud without modification (IaaS). Thinking about re-architecting the application is the next step here.

Your organization will develop better skills in using your cloud provider and can re-architect your application in an easier way, however, you will not be able to leverage the full power of the cloud

Replatforming (Optimize, Lift and Shift)

Replatforming is about moving to the cloud with a small amount of up-versioning. It is done by upgrading your existing on-premise applications and implementing cloud paradigms and best practices like multi-tenancy or auto scalability and the 12-factors app.

Replatforming is faster than refactoring, it allows taking immediate, but modest advantage of the cloud.

Repurchasing

Repurchasing is replacing the legacy on-premise application by a commercial SaaS platform.

This solution reduces the development efforts but has some disadvantages like vendor lock-in and interoperability issues.

Refactoring

This is when you re-architect your legacy applications and replace them with modern cloud-native apps.

Refactoring allows companies to benefit from the full power of the cloud but this approach is riskier and requires more knowledge and work.

Retiring

By looking at the whole IT system of an organization, we usually realize that there are some applications that are either no longer used or with low business value.

Retiring is phasing out these applications.

Retaining

In some cases, moving to the cloud is not the best solution to your problem. Not moving to the cloud and retaining is also a strategy.

DevOps Needs Standards. Standardize Before You Automate

The main role of a “DevOps engineer” or a “DevOps team” is implementing DevOps within an organization until it becomes evidence.

Since DevOps is also a philosophy, implementations differ in function of one’s understanding and interpretation.

When implementing a DevOps strategy, we usually think about automation, however, we might forget that automating the wrong process is multiplying faults. For that very reason, thinking about the process should be upstream of any implementation.

To find the best process you must find the development, production and deployment bottlenecks and move them to the beginning of your process.

On the other hand, because DevOps is not a framework or a set of instructions to follow, and because if we keep things at their philosophical level, everyone will have a different opinion and the DevOps implementation approach diverges in several directions; we need standards.

There were several initiatives to create standardized technical approaches such as the Webscale, Twelve-Factor Apps, 13 Factor Apps, The Reactive Manifesto or the work done by the OCI and the CNCF.

The Status Quo. Deprogramming Programmers

Driven by passion, some developers and engineers think of code as the goal, however, it is not.

The code is the tool and the goal is solving business and real-world problems.

Unless you are coding a side project for fun or for learning purposes, coding should not be treated as a goal.

Adding more code is always adding entropy to your code, which means more tests, more maintenance, and most probably more bugs, so think before you code.

Code Less, Think More

A soon as there is a new project, a new problem to solve, some will jump up and create a REST API or a CRUD application, deploy it to production until the first incident happens and the product limitations start to appear: e.g not scalable, vulnerable, everybody thought about deploying and no one thought about rollback ..etc

Sometime, when you ignore traditional thinking and revise old processes, they will see positive change.

Only Entropy Comes Easy

In this universe, there are laws that can be noticed everywhere … like entropy.

Entropy is “the measurement of disorder”.

The entropy of the universe increases with time, the more time passes, the more there is disorder increases if you compare it to the initial state.

The heat of a coffee left on the table will be transmitted to the cup: We say that the entropy of the cup increases.

There is plenty of entropy manifestation around us, but what is important for us is about the entropy in software and IT systems.

Software entropy refers to the tendency for software, over time, to become difficult and costly to maintain. A software system that undergoes continuous change, such as having new functionality added to its original design, will eventually become more complex and can become disorganized as it grows, losing its original design structure.
In theory, it may be better to redesign the software in order to support the changes rather than building on the existing program, but redesigning the software is more work because redesigning the existing software will introduce new bugs and problems. ~ source: Webopedia
source

We can outline two important concepts here:

  1. Any system/software that is used will be modified
  2. When a codebase is modified, its complexity will increase

Immutability and self-healing platforms are some of the new ways of thinking and designing applications that solve this problem. These paradigms showed us their strengths in creating stable and scalable applications at a faster pace.

Immutable: Cloud virtual machines and/or containers are never changed after deployments. When there is a new deployment, everything is re-created.
Self Healing: The system can identify its problems then resolves them, usually, by returning to an initial state.

Microservices, containers, and orchestration are the technical tools to implement these paradigms.

They also have other advantages, for example, containers coupled to microservices allow deploying a single service instead of re-deploying the whole monolithic application and containers orchestration frameworks allow autoscaling a microservice under load, instead of scaling the whole monolithic application.

When we approach containers, microservices, and orchestration, there is always an important aspect, common to the three concepts: Event-driven approaches.

Microservices, containers, and orchestration are reactive to change and act in function of events.

The Reactive Manifesto

Cloud Native is reactive, that’s why we will discover what is the Reactive Manifesto.

This schema explains a lot about the philosophy of the “Reactive Manifesto”:

The goal of the “Reactive Manifesto” is creating responsive applications.

Regardless of the conditions, a responsive application must always have the same behavior.

To achieve this responsivity, the application must be elastic and resilient.

Without being elastic and resilient, we can not speak of a responsive application.

Finally, building message-driven applications is the key to elasticity and resiliency.

Data Must Flow

There is something at the center of everything we develop: data. There is always data input or output.

Databases were very involved in solving the data problem, but they created a lot of other problems, in particular, problems of performance and scalability.

To optimize our use of databases, we tried several solutions, such as code optimization, which is limited.

We tried caching techniques which are also limited because we always cache a small part of the whole data. This can be limiting in the case when data demand is dynamic.

We tried materialized view to solve caching problems but it adds load to the database.

We tried to use cloud databases, and they are expensive.

We tried vertical scaling and replication, and they are also expensive and have some performance limits.

We tried NoSQL databases which are not suitable for all scenarios and use cases. They may also be expensive.

Message-driven applications solve many of the above problems. In short, they are based on message exchanges. To dive into this concept, let’s get back to the history of databases.

If we take one of the most used databases, MySql (or any other alternative technology like MariaDB), we can notice that the transaction log is the heart of the database.

It records all the events that change the state of the data and it contains events: delete, create, update, insert .. etc

The database tables hold the current state representation of these events, replication is a replay of the transaction logs and even the cache layer is a partial copy of this data.

The transactions log can be seen as a proof of concept of a message-driven system.

If this log’s events are streamed by publisher components and consumed by consumer components of the same application, we can create optimized polyglot persistence stores for each data consumer. Each service can use the most suitable storage technology (SQL, NoSQL, SQLite, raw files, graph database ..etc)

We can also avoid storing unnecessary data and add a load to the database.

Polyglot Microservices

The good thing about message-driven applications is that it completely fits the microservices development, production and data model.

Let’s take the example of Uber, do you think that the same database model and technology are suitable for all of these three services?

Uber uses Cassandra, MySQL, PostgreSQL, and other home-grown solutions to manage their data and by using microservices they can choose the right database and data model to manage passengers without impacting how drivers’ and trips’ data is managed.

To maintain the reactivity and especially the coherence of a system composed of microservices. One could use transaction logs to inform other services of changes to a particular service.

What Happens When You Update Your Facebook Profile

The event sourcing is another practice that we can find in applications based on messages exchange.

When updating your Facebook profile from your smartphone, there are several bricks that need to be notified.

The monitoring application needs to detect fraud.

There is at least one database that needs to store the changes that happened.

The indexing and search datastores (like Solr and ElasticSearch) need also to know about it.

The newsfeed service needs to publish your update so it should be notified.

In short, there are multiple data stores and services that need to be notified about a change in your profile.

Imagine the case where the mobile application must notify all of these applications as soon as you change a single letter in your first name.

The Facebook mobile application will probably take a few minutes to update all these services at once.

The solution is to write the update on a message and send it once to a streaming system like Kafka. Each other service will consume the data that interests him.

Change is published once and consumed several times.

Back to The Cloud

We have already seen that one of the advantages of cloud computing is providing IaC tools and environment that can be used in your cloud-native journey. The other advantage is the reliability of its services. Most cloud services have an SLA of 99.99 %.

We have also seen how polyglot microservices, reactive programming, and message-based services fit together perfectly like puzzle pieces. However, there are some disadvantages in relying on such architectures.

Everything will be down and dysfunctional when the streaming system is down. This is when cloud computing is a savior.

Since all of your application blocks rely on streaming data, the use of a system that has high availability is advisable.

source: aws blog

e.g: Amazon Kinesis stream can be used for event sourcing.

Architecture Patterns

Let’s Make Cheese Pancakes

John and Jane are preparing some cheese pancakes. This is what John followed as steps:

  1. Pour the flour, the eggs, and the milk
  2. Let the pancake batter rest, wait for one hour
  3. Go out and buy cheese
  4. Bake your pancakes
  5. Add cheese
  6. Eat

..and this is what Jane did:

  1. Pour the flour, the eggs, and the milk
  2. Let the pancake batter rest
  3. Go out and buy cheese while the batter rest
  4. Bake your pancakes
  5. Add cheese
  6. Eat

There is a difference between the first way of getting things done: The second way is faster because of its asynchronicity.

To create cloud native applications, we should eliminate all synchronous inter-component communication.

Event streaming helps not only in creating message-driven architectures but also in implementing asynchronously within your application.

Event Streaming

Event streaming is the basic pattern. We always have a producer and a consumer.

The first publishes a message and the consumer subscribes to a topic to receive these messages.

It is advisable to use a fully managed streaming service.

When you leave a message on the streaming system, you do not have to wait for the application that consumes that message. Therefore, the communication is asynchronous.

Event Sourcing

Event sourcing uses the first pattern (event streaming) except that there is a database that plays a role in this model.

When we talk about event sourcing, we call up the process of transforming a series of events from our streaming system to a persistent data store.

There are 2 approaches to doing this.

Event First: The publisher sends its message to the event stream, the consumer react to this event and it records it in the data store.

Database First: The user first records the state in the database, then the database propagate this change to the entire application. This can be done with CDC databases. This is not the case for all databases, generally, cloud databases are equipped with this mechanism.

Wikipedia defines the CDC as follows:

In databases, change data capture (CDC) is a set of software design patterns used to determine (and track) the data that has changed so that action can be taken using the changed data.
Also, Change data capture (CDC) is an approach to data integration that is based on the identification, capture, and delivery of the changes made to enterprise data sources.

Command Query Responsibility Segregation (CQRS)

The CQRS is based on the event sourcing pattern but separates the reading of writing.

CQRS separates reading (query) from writing (command).

The publisher records each change, usually as an event. Processing is then performed, often asynchronously, to generate denormalized data models.

The consumer simply queries these templates in order to minimize the querying of the database.

In order to better understand, let’s take the example of a user who, using a web browser, made a change in the user interface.

Any change produces an event that is sent to the event stream and is consumed by a subscriber that creates a materialized view for the consumer (reading).

In other words, the event is transformed into a state in the database of the reading.

The event is also written to the (writing) data store.

The advantage is that we can, for instance, write in a Mysql dat store and read from a NoSQL data store .

Data is a Precious Thing and Will Last Longer Than the Systems Themselves.

The title is a quote said by Tim Berners-Lee, the inventor of the World Wide Web.

Indeed, IT systems and architectural evolution show us today that the data didn't”t change but we changed the way we consume and produce it.

(Orchestrated) microservices is the best example that shows us how far software and infrastructure architecture has changed.

Microservices has the advantage of allowing developers to choose freely the database technology to use with a microservice or a specific operation (eg. reading vs writing in CQRS).

This solves many problems related to data like performance, replication, scalability, and complexity.

Choosing the right data store technology brings icing on the cake. With more than 300 database technology, this can be intimidating:

So, start by picking up your choice criteria like read/write performance and latency. This will get you closer to the right choice.

source

These are some good articles that may help you make a better choice:

We have seen 3 patterns of cloud-native architectural patterns, there are more to see

A book that I recommend, if you want to read more about these patterns, is Cloud Native Development Patterns and Best Practices: Practical architectural patterns for building modern, distributed cloud-native systems:

These are some patterns:

  • Cache-Aside
  • Circuit Breaker
  • Claim Check
  • Ambassador
  • Anti-corruption Layer
  • Backends for Frontends
  • Bulkhead
  • Command and Query Responsibility Segregation (CQRS)
  • Compensating Transaction
  • Competing Consumers
  • Compute Resource Consolidation
  • External Configuration Store
  • Federated Identity
  • Gatekeeper
  • Gateway Offloading
  • Gateway Routing
  • Gateway Aggregation
  • Health Endpoint Monitoring
  • Index Table
  • Leader Election
  • Queue-Based Load Leveling
  • Materialized View
  • Pipes and Filters
  • Priority Queue
  • Publisher/Subscriber
  • Retry
  • Scheduler Agent Supervisor
  • Sharding
  • Valet Key
  • Sidecar
  • Static Content Hosting
  • Strangler
  • Throttling

Conway’s Law

Designing cloud-native applications and systems is not only about architecture and technical patterns but also about how processes are implemented and managed.

The challenge of microservice is also human.

In fact, The Conway Law is among the known laws in leadership.

This law assumes that the architecture of a system will be a copy of the communication scheme within an organization.

Besides, there are some caricatures that can be found on Intenet.

source

This law can be applied in software architecture to create microservices. The organization of teams and the interaction between them, directly affect the architecture of your application.

Communication Gets Terrible as Team Size Grows

If you understood how Conway law may affect your IT architecture, you can imagine how terrible will be the communication between your microservices, if development teams( communication is also terrible.

Large teams are always a bad idea, they are less autonomous, less innovative, more dependant to other teams and less transparent since the communication becomes difficult within the team members.

However, the problem of large teams is not exactly its size but the number of links between teammates.

To calculate the number of links or communication channels, Richard Hackman, the author of Leading Teams: Setting the Stage for Great Performances created this formula:

The number of links = N(N-1)/2

(where N is the number of people)

You can see how the number of link increases rapidly and can reach 91 channel of communication when the team has a size of only 14 people.

Lines of communication diagram

Nine Women Can’t Make a Baby in One Month

This part could also be entitled “The Brooks’s Law”.

Fred Brooks in his book The Mythical Man-Month, highlighted a central idea saying that “adding manpower to a late software project makes it later”.

Some of the causes of Brooks’ Law are training time since it takes time for new people to speed up even if they are experts in the used technology and the fact that communication becomes more complex when adding new people to a team.

The Two Pizza Team — Jeff Bezos

Jeff Bezos, the founder of Amazon also agrees on the previous laws and has his own vision of team management.

If 2 pizzas are not enough for your team, you should reduce the number of people. This philosophy is among the success factors of Amazon particularly on the technical level.

Photo by Fancycrave on Unsplash

Cross-functional / Feature Teams

A cross-functional team is a group of people with different functional expertise (marketing, operations, development, QA, account managers ..etc) working for the same goals and projects.

A group of individuals of various backgrounds and expertise is assembled to collaborate in better manners and solve problems faster.

As said in Wikipedia: The growth of self-directed cross-functional teams has influenced decision-making processes and organizational structures. Although management theory likes to propound that every type of organizational structure needs to make strategic, tactical, and operational decisions, new procedures have started to emerge that work best with teams.

In DevOps context, the dev and ops teams should not live in separate silos. Each team should provide support and pieces of advice in order to take advantage of the skills of everyone.

Photo by Mike Benson on Unsplash

According to some management studies, like Peter Drucker’s on management by objectives in his book The Practice of Management, cross-functional teams are less goal dominated and less unidirectional which stimulates the productivity and the capability of dealing with fuzzy logic.

Cross-functional teams are the best fit for microservices development, as products are developed, tested, built and deployed by the same team, which makes work faster, easier and transparent.

For more reading, I recommend my article The 15-point DevOps Check List:

The Answers Have Changed

In 1942, Albert Einstein was a professor at Oxford University and one day he just gave a physics exam to his students.

When he was walking around one day, his assistant asked him a question:

“Dr. Einstein, this exam you gave your students, it’s exactly the same you gave last year ?!”

(source unknown)

Einstein replied:

“Yes, it’s the same exam, but the answers have changed. “

It’s a bit similar in software engineering, we are always trying to solve the same problems: How to create a stable and functional application, but the answers change quickly.

Today, we have some answers to the problems of modern software development, but that does not mean that these answers will not change.

The answers have changed and will always change.

Connect Deeper

Make sure to follow me on Medium / Twitter to receive my future articles and subscribe to one or more of my hand-curated newsletters and Slack team chat:

  • DevOpsLinks : An Online Community Of Thousands Of IT Experts & DevOps Enthusiast From All Over The World.
  • Shipped: An Independent Newsletter Focused On Serverless, Containers, FaaS & Other Interesting Stuff
  • Kaptain: A Kubernetes Community Hub, Hand Curated Newsletter, Team Chat, Training & More

You can also check my online training Painless Docker and Parctical AWS .

If you like this article, let me know by buying me a coffee here and I’ll publish similar articles explaining advanced concepts in an easy way.