Stories by Aayush Kumar on Medium

DevOps In Finance Services — A Modest & Predictable Success

Aayush Kumar — Sat, 15 Jun 2019 15:29:43 GMT

DevOps In Finance Services — A Modest & Predictable Success

We are living in a quick changing condition where sudden innovative, financial and political changes can happen unexpectedly and where it is important to adjust rapidly to these changes. which is particularly valid in the financial segment also, where there is a more noteworthy need to surpass the desires for clients and give continuous service. This financial business has dependably been among the most innovative interesting and technologically advanced areas of business. These days are the age of the digitized world, where clients need their information to be accessible on the go unlike old times, that is so made with various convenient handy applications and their Personally Identifying Information (PII) like banking card numbers to be kept secure.

So in order to comply with the upcoming competition that are born daily and also to stay competitive on the market the financial industries have now to take advantage of the new and the best practices that are in trend so as to offer end to end uninterrupted and user friendly service to the client by making the best available use of IT technology.

Earlier Waterfall methodologies created an inflexible approach to application development and mounting technical debt. Not only this many firms pursued agile methodologies for development project bringing a digital solution to life quickly often in just weeks, or even days. It leads to incomplete existing operations stack, and to the technical debt.

Now to answer all our needs, DevOps is the solution to it, a modern method of organizing the various software delivery pipelines, the operational workflow, the increasing business process its demand and also ensuring reliability optimizing expenses in turn delivering more value can be achieved by it. DevOps provides a fully agile approach to application development and maintenance keeping to the strict regulation gates. Integrating DevOps in the financial services sector can not only help us to achieve the above-mentioned goals but also has various added advantage such as:

Using the principles of Continuous Integration(CI), Continuous Deployment (CD) and provisioning the immutable Infrastructure as Code (IaC) thus ensuring security and compliance.
Better collaboration between teams.“You build it, you run it” is one of the main DevOps principles.
Automation of daily routine tasks helps in improving operational versatility and infrastructure performance.

The entire service life cycle works together starting from design via the process of development to production operations, maintenance, and support. DevOps is helpful for the adoption of other technologies such as blockchain

Today, a new challenge faced by the finance sector in its day to day operation is to meet the upcoming standards. To be in pace with the upcoming digitalization, many companies are modernizing their back-end systems and processes. Compliance with regulations, alleviating risk for better security and meeting governance requirements also are the factors that lead to software innovation a slow and complex process. Still, during the last few years, the financial services sector is leading the change by adopting latest technologies and modern software delivery practices including DevOps, continuous delivery (CD). Based on the recent report 80 percent of the firms in the financial services sector are already implementing DevOps, 14 percent are trying to do so in the coming 12–18 months period contributing to an overall rate of 94 percent.

CAMS and DevOps

Here’s a good acronym CAMS for defining the implementation of DevOps in true sense.

Culture: Everybody needs to support this vision that DevOps need to be successful and needs to be implemented.

Automation: Automating the tedious or time-consuming tasks and link all used tools together into a single (automated) process.

Measurement: Objective measurements are needed to understand what needs to be automated first and where to improve.

Sharing: To improve ideas within the team and also across the team.

A sense of ownership develops while adopting to DevOps as the team owns the entire life-cycle that means a single team becomes responsible for all aspects resulting in a better quality of the software and maintenance.

Apart from this DevOps is trying to improve and accelerate the most time-consuming processes, are able to deliver every release faster and more reliably. Implementation of DevOps provides a fully agile approach to application development and maintenance while keeping to the strict regulation gates which can be automated to continuously pass and validate. DevOps environment breaks it into smaller sections and automates the process. It is a cultural change allowing organizations to break down the barrier between the development team and the operations team.

SIX — Financial Group is a striking example of a successful DevOps transformation in the finance sector both technically and organizationally.

DevOps usage takes out the security issues before they become a noteworthy risk and scales the IT forms over the entire venture, along these lines enabling a start to finish and enthusiastic pathway to creation. Consistently, more associations are changing to DevOps and not thinking back. The change to DevOps culture may require some investment, however, it will alter the exercises of an association, including monetary administrations.

At OpsLyft, we believe in DevOps as a way to practice engineering. We work with customers to provide state of art DevOps solutions and help them save money their cloud bills. Write to us at contact@opslyft.com, I’m sure we’ll just fit in :)

DevOps In Finance Services — A Modest & Predictable Success was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

DevOps: A prescription leading to a cure for healthcare

Aayush Kumar — Mon, 03 Jun 2019 19:03:15 GMT

Image Credits: Getty Images/iStockphoto

My last couple of months were spent on DevOps consulting for a fast-growing US-Healthcare based tech startup. The realization, which in no time I experienced was that the healthcare industry has to comply with complex data security regulations. But at the same time, this industry has goldmines of data at their disposal. Given that multiple regulations and strict security have to be ensured while analyzing and processing data, a DevOps approach for infrastructure management is very much required in healthcare

The healthcare industry has grown and matured significantly over the decades. From shelves filled with files, it has moved on to utilizing the latest IT technology available, dedicated data centers and public or private cloud infrastructure. While the advantages of adopting DevOps methodology throughout IT-Infrastructure are proven, here’s what it means to adopt DevOps workflows in healthcare:

Adherence to compliances and data security: Handling the personal data of customers became much more closely monitored and regulated in the present day scenario. Several of the main requirements of the new regulation is limiting the levels of access to the data to the minimally required, ensuring the transparency of data handling, security of data storage, and the customer’s right to demand the deletion of their personal details whenever he asks for. From a DevOps perspective, this means eliminating all kinds of waste, as well as ensuring optimal performance and security of operations. In terms of software delivery, automated unit tests ensure thorough code testing to minimize the risk of code bugs and backdoor access to the apps. In terms of operations, constant monitoring of infrastructure performance allows determining the optimal ways of system functioning, removing multiple bottlenecks and potential security breaches. In terms of compliance, the regulatory requirements are easily codified and automatically applied while designing and implementing the cloud infrastructure, greatly reducing the operational overhead and simplifying the compliance checks from the authorities.
Shorter time to market for your products: When the cloud infrastructure management is automated, the software delivery lifecycle becomes much more predictable and the resulting time to market for your product updates is significantly shorter.
Improved infrastructure efficiency: Developing applications and healthcare services is a critical process to deliver new healthcare solutions. However, this process must be managed properly so that you don’t experience vast amounts of resources sprawl. DevOps offers ways to leverage resources much more efficiently. For example, an event-driven serverless architecture allows developers to leverage only the resources they need for the app or microservice being developed. It’s basically the utility consumption of resources when creating applications and services. So, when you’re leveraging a healthcare cloud provider (AWS or Azure) on their HIPAA-compliant architecture, you can do some pretty powerful things with DevOps and application delivery.
Data-driven approach: As patient data continues to grow at an exponential rate, healthcare providers have turned to DevOps in an effort to accelerate their complex big data project deployments. With a variety of data sources ranging from Electronic Health Records (EHRs), medical devices, pharmacies, and lab reports, to insurance claims, comes the need to structure, update and analyze that data for effective and improved healthcare delivery. The deployment delays associated with a traditional software development life cycle can significantly raise the cost and affect the overall usability of the project.

Most of the public cloud vendors are now offering solutions which can be effectively leveraged to build healthcare management platforms at scale. Following are a few:

The healthcare industry is rapidly developing data-driven technology solutions to further patient-centered care all while reducing costs. With the healthcare landscape shifting to an on-demand delivery system of medical services and solutions, healthcare providers and hospitals are turning to DevOps in an effort to advance healthcare innovation.

At OpsLyft, we are passionate about everything related to the cloud. We work with customers across geographies to provide state of art DevOps services. We’d love to hear your DevOps challenges and success stories, please do write to us at contact@opslyft.com. I’m sure we’ll just fit in well for you :)

DevOps: A prescription leading to a cure for healthcare was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Digital transformation and the road to AIOps

Aayush Kumar — Mon, 25 Mar 2019 07:11:54 GMT

The idea behind writing this post is based on a quick call with one of my college juniors where I was trying to explain her about Artificial Intelligence and it’s possible applications in DevOps. According to her, it’s a system which can be run by any person irrespective of background. That’s perfect!

In my previous post on the dependency of DevOps on AI, we saw that how AI can help DevOps teams develop, deliver, deploy and organize applications to improve the performance and perform the business operations of DevOps. Now if we think deeper into this, we know for sure that ‘DevOps & Cloud’ has already become a promising combination for many companies across the world. Though cloud and DevOps are different propositions, they are intertwined and this combination provides agility and efficiency in IT operations. Automated IT infrastructure is already well-established. Self-healing systems are either not far off with the arrival of containerized orchestration tools (Kubernetes, Docker swarm, etc.). Automated build/deployment of the entire cloud infrastructure using DevOps pipeline in agile fashion is common now. AI will further expand the boundaries of IT infrastructure automation. Future will see intelligent infra powered by sophisticated algorithms using technologies such as machine learning (ML) and deep learning. From an execution point of view, this can be achieved through AI-Ops.

AIOps is the application of artificial intelligence for IT operations. It is the future of ITOps, combining algorithmic and human intelligence to provide full visibility into the state and performance of the IT systems that businesses rely on. Successful digital transformation will rely on AIOps to enable IT to operate at the speed that modern business requires. Promising AI-Ops products as per Gartner should have the following characteristics :

They should help reduce noise (for example, in the form of false alarms or redundant-events)
Provide better causality, which helps identify the probable cause of incidents
Capture anomalies that go beyond static thresholds to proactively detect abnormal conditions
Extrapolate future events to prevent potential breakdowns
Initiate action to resolve a problem (either directly or via integration)

However, with the existence of so many DevOps tools, we need to understand that where do AI-Ops tools fit into modern IT Environment. When looking at AIOps for the first time, it is not immediately obvious how it fits into existing categories of tools. The reason is that AIOps does not replace existing monitoring, log management, service desk, or orchestration tools. Instead, it sits at the intersection of these different domains, consuming and integrating information across all of them and providing useful output to ensure a synchronized picture is available from every tool. The whole process will include applying ML models to do the historical data analysis and predict the future of operations on a timeline, highlighting the potential issues and suggesting possible remediation. There could be various manifestations of this troika — AI + Cloud + DevOps; however, we are still at the nascent stage of working with this. But, the basic contour shall consist of embedded intelligence to automate applications/infra, self-learning applications, and in-built governance powered by integrated analytics.

To bring such systems into action, Gartner suggests following optimizations that should be brought in to transform IT-Ops to AI-Ops :

Deploy AIOps by adopting an incremental approach that starts with historical data, and progress to the use of streaming data, aligned with a continuously improving IT operations maturity.
Select platforms that enable comprehensive insight into past and present states of IT systems by identifying AIOps platforms that are capable of ingesting and providing access to text and metric data.
Deepen their IT operations team’s analytical skills by selecting tools that support the ability to incrementally deploy the four phases of IT-operations-oriented machine learning: descriptive, diagnostic, proactive capabilities and root cause analysis to help avoid high-severity outages.

The adoption of AI with DevOps + Cloud is symbiotic. If we look at it technically, the state of software engineering is such that deep learning and machine learning are now progressively becoming mainstream. We no longer need to understand the mathematical jargons like ‘stochastic gradient descent’ or ‘back propagation’ to apply deep learning concepts. We will also not have to write a thousand lines of python code to build a native chatbot. Hundreds of machine learning/deep learning models are now available as managed service on the cloud, along with various AI tools provided by cloud platforms. Cloud providers are trying to make it easy to run the machine learning workloads on their platform. They are offering virtual machines (VM) based on the graphics processing unit (GPU) to build ML applications in the cloud, APIs for pre-built models and natural language processing (NLP) engines to integrate with their applications. Companies are making AI more accessible to the individual developer. AWS sagemaker is one such effort to make machine learning kit available to common developers for building intelligent applications. We will have products/services in-built with machine learning algorithm, like sentiment analysis, predictive algorithms and deep learning models. Prominent ELK stack, Splunk has already seen machine learning concepts infused in their products to identify anomalous patterns, correlate events between infra, application and business environments.

To conclude, the combination of AI, DevOps and Cloud are going to change the way business is conducted across sectors. DevOps and AI will keep moving up in the value chain of technology stack along with Cloud. Intelligent automation will become the new normal, driving new innovations and standards. Enterprises should start finding ways to ingest implicit intelligence into their IT ecosystem.

References :

At OpsLyft we are trying to build DevOps tools and platforms powered by AI/ML. We do provide experienced DevOps consulting also. We’d love to hear from you, reach out to us at contact@opslyft.com. I’m sure we’ll fit in somewhere for you.

Digital transformation and the road to AIOps was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Strong Interdependency Between AI & DevOps

Aayush Kumar — Mon, 11 Mar 2019 04:44:51 GMT

As a part of my entrepreneurial journey, meeting industry experts and gathering insights from them is a regular thing now. I met someone this weekend where we talked about what it takes to build an ideal DevOps world. A world where product owners, development, QA, IT Operations and Infosec work together, not only to help each other, but also to ensure that the overall organization succeeds. By working towards a common goal, they enable the fast flow of planned work into production(e.g. performing tens, hundreds to thousands of code deploys per day), while achieving world class stability, reliability, availability and security. In this world, cross-functional teams rigorously test their hypothesis of which features will most delight users and advance the organization goals. They care not just about implementing user features, but also actively ensure their work flows smoothly and frequently through their entire value stream without causing chaos and disruption to IT operations or any other internal or external customer. But this is hard, the process of constant improvement with human involvement is a challenge when you operate things at scale. Hence, Automation is the Key! It’s the lifeline. This contributes to better sync among the teams and eventually faster and more accurate deployment and releases.

However, can we make our automations smart and self learning ? Think about automation over automations which just knows what you want for your infrastructure !

Yes, I’m talking about using AI/Machine Learning capabilities to enhance DevOps. But to recognize any benefit with AI and DevOps, a creative mindset may be required. AI can change how DevOps teams develop, deliver, deploy and organize applications to improve the performance and perform the business operations of DevOps.

The future of DevOps is AI-driven, helping to manage the immense capacity of data and computation in day-to-day operations. AI has the potential to become the primary tool for assessing, computing and decision-making procedures in DevOps.

For example, for the most effective medical diagnostic process, you don’t just depend on that detection alone. You use that detection to empower a human diagnostician who can apply a broad understanding of pathologies and deep experience with the complexities of individual patients to deliver the highest quality care. In DevOps, we can do the same. We can use AI to capture insights that teach us how to continuously optimize our workflows and processes. We can also use our AI learnings to push our work up higher on the value chain.

Collaboration between DevOps and AI can have numerous use cases. Some of them can be :

Smarter Development : We all learn through iterations. Same goes for machines. Most machine learning systems use neural networks, which are a set of layered algorithms that accept multiple data streams, then use algorithms to process that data through the layers. You train them by inputting past data with a known result. These learning systems can also be applied to data collected from other parts of the DevOps process. This includes more traditional development metrics such as velocity, burn rate, and defects found etc.
Smarter Monitoring : If you’re beyond the beginner’s level in DevOps, you are likely using multiple tools to view and act upon data. Each monitors the application’s health and performance in different ways. What we lack, however, is the ability to find relationships between this wealth of data from different tools. Learning systems can take all of these disparate data streams as inputs, and produce a more robust picture of application health than is available today.
Predicting Faults : This relates to analyzing trends. If you know that your monitoring systems produce certain readings at the time of a failure, a machine learning application can look for those patterns as a prelude to a specific type of fault. If you understand the root cause of that fault, you can take steps to avoid it happening.
Feedback Mechanisms : One of the biggest problems with DevOps is that we don’t seem to learn from our mistakes. Even if we have an ongoing feedback strategy, we likely don’t have much more than a wiki that describes problems we’ve encountered, and what we did to investigate them. All too often, the answer is that we rebooted our servers or restarted the application. Machine learning systems can dissect the data to show clearly what happened over the last day, week, month, or year. It can look at seasonal trends or daily trends, and give us a picture of our application at any given moment.

We can literally can derive numerous of use cases over a coffee when it comes to working AI with DevOps. Having said that, it’s first very important to Know Your DevOps First. As enticing as it may be to dive headfirst into AI, you’re not going to be as effective as you can be if you lose the humanity from your dev team. You don’t want to be so reliant on robots and so dysfunctional as humans that when it comes to complex problems, you are functionally unable to process or resolve them. At OpsLyft, we believe that the future of AI and DevOps is bright. There’s a future here where the rote business of work that we all deal with every day will be as archaic as accounting by hand. We’re in an exciting time.

We’d love to hear your stories about DevOps automation and possible use cases of AI/ML to it. Reach out to us at contact@opslyft.com as we surely can help you enhance your cloud by simplifying DevOps for you :)

Strong Interdependency Between AI & DevOps was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Microservices at Scale: Apache Mesos

Aayush Kumar — Mon, 25 Feb 2019 14:49:47 GMT

Once of the best evolutions seen in software engineering is the paradigm shift from Monolithic Systems to building something called Microservices or Service Oriented Architecture (SOA). Having a single codebase for a product seems clean when it’s reasonably of a small scale. But you may start running into problems as your codebase grows. Some of the common problems that people face :

As new features are introduced, components become tightly coupled and isolation becomes difficult.
Continuous Integration becomes a liability rather than an asset as everything in the codebase have to be redeployed for even a small change.
It takes real effort and confidence for pushing in a big code change as you know for a fact that if it breaks, your entire system would go down.

With SOA, instead of building a singular system which handles all your business use cases, we divide it into isolated fundamental units or sub-systems, each of which handles a business logic. Every sub-system can then be exposed as internal API’s (services) which can be integrated to other services as desired. Some of key advantages of this approach :

Each service can have it’s own tech stack and can have its own codebase.
Maintenance of the overall product becomes easy due to isolation of services from one another.

It was easy for all to get convinced on the advantages SOA over monolith as a design choice for systems. But if we start thinking a step further, we see for a fact that handling and maintaining SOA based systems at scale is not so straightforward.

To elaborate more on this, think of a company which has isolated services with backends such as Hadoop clusters for data processing jobs, docker containers for small apps, in-house database servers etc. For an organisation having such diverse technical stacks, it becomes important for them to pay detailed attention to the distribution of infrastructure resources which hosts these services.

The most important and popular challenge is how to distribute the resources in such a manner so that there is minimum under utilisation of server capacities. All these resources come at a cost and when you’r dealing things at scale and this cost is huge ! Following are some of the common practices which people adopt to cope up with this challenge:

Statically isolating fixed number of servers as per requirement in the cluster for every work environment. Ex : 25 dedicated machines for a Hadoop Cluster, 15 for Kubernetes and 10 for MPI cluster.
Having one single server with huge computing and storage capacity and running VM’s in it for different work environments.

The diagram below illustrates a standard architecture for implementing the above mentioned practices :

Classic Cluster Architecture. Source : https://www.inovex.de/blog/apache-mesos-an-introduction/

Now let’s assume a very basic use case where a data processing job requires the Hadoop cluster to run only for 6 hours in a day. Going by the above approach, Hadoop cluster goes under-utilised for remaining 18 hours in day. This, for sure is a big hit in terms of cost. So all in all, the first approach mentioned above isn’t efficient in terms of infrastructure cost. In most of the cases, services running on these individual clusters are so much independent in nature, that it becomes difficult to collectively share the cluster resources across the services, making it difficult to share clusters and data efficiently between them. So this disregards the second approach also to an extent.

At Indix HQ we use Apache Mesos which is a cluster management open source solution. Mesos is a distributed system kernel and works as a cluster scheduler. It abstracts the resources (CPU, memory, I/O, network) of a cluster for end users. So Mesos abstracts the whole cluster resources into one big computer and allows the user to have OS functionalities on a cluster level. Have a look at the following diagram :

A cluster with Mesos as scheduler. Source : https://www.inovex.de/blog/apache-mesos-an-introduction/

Apache Mesos runs on any POSIX oriented operating system (e.g. Linux and OSX) and allows you to share resources between multiple frameworks, which are handled kind of like an application. With Mesos you are able to combine all these resources into one big cluster and run different workloads on it. As a cluster manager, Mesos was architected to solve for a very different set of challenges:

Colocate diverse workloads : Same infrastructure can support analytics, stateless microservices, distributed data services and traditional apps to improve utilisation and reduce costs.
Provide evergreen extensibility : To run new application and technologies without modifying the cluster manager or any of the existing applications built on top of it.
Elastically scale : Scale application and the underlying infrastructure from a handful, to tens, to tens of thousands of nodes.

Now let’s take a deep dive on how Mesos works internally. Diagram below shows the Mesos architecture :

Mesos Architecture. Source : http://mesos.apache.org/documentation/latest/architecture/

Typically there are four major components of Mesos :

Master: Coordinates the work and decides which framework gets how many resources
Zookeeper: Used for coordination of the masters
Slave: A worker node which provides its resources to run tasks of a framework
Framework: Has a scheduler component which decides where a task gets launched and an executor which executes one or more tasks at the Slave.

And this is how it works: When a Slave notices that it has free resources it sends an offer to the Mesos master which includes its Slave ID and the free resources. The allocation module inside the master decides which framework will get the offer. When a framework receives an offer it can decide how many of the resources it will take. For example the framework may only take the CPUs of an offer to start its tasks. When the framework has decided which resources it will take and how many tasks it will start it sends a message to the master. The message contains the number of tasks and the resources that it will allocate for each task. In a last step the master passes this information on to the slave that reads these taskinfos (taskname, slave id, ressources) and starts the tasks. When the tasks have finished and the resources are free again all these steps will be repeated.

Mesos has a unique ability to individually manage a diverse set of workloads including traditional applications such as Java, stateless Docker microservices, batch jobs, real-time analytics, and stateful distributed data services. If you want to build a reliable platform that runs multiple mission critical workloads including Docker containers, legacy applications (e.g., Java), and distributed data services (e.g., Spark, Kafka, Cassandra, Elastic), and want all of this portable across cloud providers and/or datacenters, then Mesos (or our own Mesos distribution, Mesosphere DC/OS) is the right fit for you.

In the next blog I’ll talk about framework named as Marathon which we use on top of Mesos as an orchestrator for deployment of our internal as well customer facing microservices.

Thank You for reading this. I am open to any kind of feedback on this.

References :

We at OpsLyft help organisations meet their goals for DevOps and make them win with it. Get in touch with us at contact@opslyft.com for further assistance and help.

Microservices at Scale: Apache Mesos was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

A Tale of Configuration Management

Aayush Kumar — Mon, 25 Feb 2019 14:43:48 GMT

Like in any growing product startup, tech stack also evolves with time. It’s iterated over and over (sometimes from scratch) as it scales to achieve it’s engineering milestones. These milestones were more of definitive engineering processes and practices which were validated and confirmed to be fitting our use cases. One such evolution which I experienced was of Configuration Management.
This article talks about my journey at Indix HQ of Configuration Management and how we ended up using Ansible as a standard for infrastructue creation, provisioning and deployments.

Previous State

As of circa Aug 2015, we at Indix used a variety of tools and languages to take care of infrastructure creation, provisioning and deployment.

Infrastructure Creation: AWS command line tools, fog, knife-ec2
Provisioning: chef, chef-solo
Deployment & Orchestration: Capistrano, fabric, shell scripts, Go-CD commands, beanstalk etc.

There were some shortcomings with this approach:

Failures happened since there was no tight coupling between infra creation, provisioning and deployment. Failure at one stage leads to cascading failures later.
The tooling to get things in production is pretty complex. Understanding multiple tools and getting familiarity with a new language ( ruby / python ) for most developers is not optimal. This leads to less participation from a larger group of developers in taking care of the systems built. Also it causes a lot of burnout for the DevOps team.
We didn’t have any single place to know the source of truth of the system. We largely use chef in a pull based mode. This implies that agents keep polling central chef server to know what changes are required to be applied to the system. We experienced state discrepancies where github code, chef server state and machine internal state don’t match. This happened either due to lack of proper checks and balances in the current way of working, developer negligence or sometimes even due to failure during execution of chef-agent which fails silently without raising any alerts. The end result of this invariably leads to production downtime / failure of some or many of our services. This apart we have also seen central chef-server becoming the single point of failure leading to bad state of our systems.

Our Experience & Learning

Too many tools to learn which have significant learning curve. Also very tight coupling of provisioning across different teams / projects.
Extremely difficult to quickly create a exact replica of production environment to experimental so to try new code changes.
Difficult to tweak with hardware ( amazon machine types & volume types ) which is required for any tuning and battle-hardening any backend system .
No consistency or simple way to configure some of the basic things ranging from monitoring, logging, system sanity checks
Neither idempotent nor immutable. This implies that tomorrow we can’t simply rerun our entire infra creation, provisioning and deployment in a single way multiple times and be certain that things will keep on working fine.

Besides the aforementioned pain points, we also realised that there were some things that we would like to have a quicker developer environment setup using virtual boxes or docker. This shall provide us the capability to quickly test things locally and write integration tests.

So to paraphrase we wanted a system which should be:

Dead simple. This means very low learning curve and more developer participation. The indirect consequence of it is helping people move fast and get things done rather than being dependent on the devops team entirely.
Single (or minimal ) tool for infra creation, provisioning and deployment
One way to do things. Not 5 different languages and frameworks. Something like YAML.
Building idempotent and immutable infrastructure. It means you can run the same script to create, provision and deploy infra multiple times and it should simply work. This also means that recreating infra, provisioning an deployment shouldn’t be a exercise in frustration but becomes a part of the way we operate
The above point ensures that it is easy to replicate complete environments with minimal overhead. This also ensures that our continuous delivery system is used to just trigger and doesn’t have any state or intelligence
Ability to quickly create local dev setup, test environments and write integration tests

Ansible Ecosystem

We actively followed ansible for over a month plus before seriously trying it out. It seems to solve all the problems that were described earlier in a really simple and elegant fashion.

The major problems that it addresses are:

Single place to create infra, provision, deploy and orchestrate.
You write only in YAML.
Primary mode of operation is push based mode which means post deployment system is immutable. It also gives flexibility to be used in pull based mode which is useful for say Hadoop cluster where nodes may come and go. However in both cases github source is single source of truth.
Great documentation and solid pre built modules. So for complex things like machine creation and ensuring that only a single machine in a group exists you dont’ have to do anything. Most modules written are idempotent. This is not true just for provisioning ( where it is true for chef & puppet also ) but for even infra creation and deployment.

Current State :

As of now, most of Indix’s infra is under configuration management via Ansible. This includes migrating legacy systems from no configuration management to Ansible based setup and a few Chef based setups to Ansible based. Any new system that’s being built by developers is brought up using Ansible.
Owing to it’s simplicity, the best and perhaps the most important milestone that we are able achieve is that developers now build their systems end to end, meaning that they not only focus on writing application code but also write Ansible scripts for the application’s infrastructure creation, provisioning and deployments. This is a notable cultural practice which has been standardised which also eases and distributes out the ownership and responsibility of DevOps team with developers.

We at OpsLyft help organisations meet their goals for DevOps and make them win with it. Get in touch with us at contact@opslyft.com for further assistance and help.

A Tale of Configuration Management was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Demystifying DevOps with Operability Review

Aayush Kumar — Mon, 25 Feb 2019 14:29:56 GMT

As a part of Software Development Lifecycle, the major focus of developers is mostly around the coding, optimising and testing the business logic. However, while or even before writing code for the application, not much attention is given to what would happen when application as a dependent/independent system would be up & running in production. This leads to lot of complexities, inefficiencies, unwanted surprises and challenges when the code is pushed to production — most of which could have been eliminated or reduced had we thought of them little early. There comes the necessity of Operability Review. By the term Software Operability — we mean the readiness of a new software to be deployed in production and start handling real traffic. The whole purpose of doing Operability Review of a service is to ensure that it runs Wellonce deployed, operates as smoothy as possible, very minimal manual intervention is needed to make it up & running.
While my stint at Indix HQ, I took operability seriously and engage everyone to have regular reviews around the operability items so that we don’t miss anything important and discover them late. In this blog, we will discuss the process we follow, the areas we focus while doing such reviews and how good this has proved to our product life cycle.

Elements of Operability Review

Before we begin to discuss how do we perform operability review in Indix & what are the areas we cover under that, let’s first try to classify various aspects of a system running in production. From a very high level, health of any system/ application can be categorized as follows -

System Architecture
Configuration management & deployment
Security
Data storage
Monitoring
Performance

Above are the areas which we should further break up into specific components & review them separately. Also note that, Cost is a critical area which is not mentioned above since it is not really an attribute of the system. But everything we do — we try to optimise the cost incurred due to the same. The following diagram shows the coverage of all the aspects under Operability:

Image Credits : Arijit Bhattacharyya

System Architecture

Architecture & Deployment diagram:

The first thing is to have the deployment & architecture diagrams. Where architecture depicts the logical flow of request/ response inside the system, deployment diagram is focussed on the actual infrastructure. But if the system is not too complex, it makes sense to have them in a single diagram. Having these diagram/s helps people unfamiliar with the application to get a quick understanding of what is inside. This also helps in technical discussions around the other areas under operability.

Deployment platform:

Our entire infrastructure is in AWS so we really don’t have options in terms of hosting. Though we plan to evaluate other Cloud providers like Azure or GCE in future, EC2 is the choice at this point.

We have both docker & non-docker services running (On EC2, we are not using ECS) & we have applications running both on EC2 directly as well on a Kubernetes & Mesos cluster. We use Mesos heavily & several internal applications as well as some user-facing services are deployed in Mesos. Based on the nature of the new service, we decide whether it’s a right candidate to be deployed on Mesos or it should be on EC2 instances.

Load Balancing:

If the service runs on multiple instances & needs load-balancing, then some of the important things to decide for the ELB are -
Is the service user-facing ? If not, make sure to use internal ELB (Routing via private IPs) as network packets will traverse via public internet otherwise (which doesn’t make sense & might have security concerns).
Make sure to enable Cross-AZ load balancing. This helps to distribute requests among the registered nodes irrespective of whether they are evenly spread across the AZs or not.
Enable Connection Draining with appropriate Timeout settings. This is to ensure no interruption of the inflight requests when an instance is being taken OOR for maintenance or something.
Set Idle Timeout appropriately. Idle Timeout is the duration of a connection can be in idle state before the ELB closes the socket. This applies to both client side as well as EC2 side sockets.
Enable Access Log for ELB. This is very important as without the access logs, it becomes too difficult to troubleshoot issues in production systems when the requests are coming through load balancer. We have a S3 bucket which is designated to store logs of every ELBs under top level folders/ paths. It might also be useful to push these logs to your centralized logging systems for further analytics.
Next is the Stickiness settings. Generally speaking, Stickiness is against the very concept of load balancing but sometimes one may need to use session affinity to ensure that every client requests in a session is routed to a particular instance. This is generally implemented using AWSELB/ application Cookies.
Finally, what kind of load balancer is needed. If we need features like content based routing or multiple listener ports per EC2 nodes — ALB is the choice. But note that ALB doesn’t support TCP listener yet so only HTTP/s protocols can be used. Use ELB otherwise.

High Availability:

This is a very broad area & we generally take decisions based on the nature of the service & it’s SLA committed to customer/ internally (for internal systems). There are different areas to look into for ensuring Resiliency of your application or achieving High Availability -

Deploy your instances in multiple regions & in multiple AZs in each of the regions.
Figure out any SPOF in the deployment. If that blocks the real time response of the app, then this is not acceptable. We need to ensure replica of the same exists & is distributed across independent network infra. If the service is using RDBMS, then we generally use RDS which has Multi-AZ deployments. That ensures synchronous replication of the data to a standby instance in a different Availability Zone (AZ).
The last but probably the most important is — Choice of EC2 Purchase type. We use Spot instances heavily to reduce our cost but that brings the additional difficulty of maintaining high availability. Depending on the nature, it may not be possible to use spot instances at all but most of the times the strategy of hybrid model works.
We imitate real production scenarios & try to estimate the numbers for MTTD (Mean-time-to-detect)/ MTTR (Mean-time-to-recover)/ RPO (recovery point objective)/ RTO (recovery time objective) for our services.

Scalability:

Scalability is extremely important as it helps you to never over provision any resource & scale that out as & when necessary. Thats is an important requirement for optimising the cost of your infra. The only thing to ensure is that your design of the service should be horizontally scalable & not bound by a single m/c due to local data storage/ in-memory data etc. Also, Auto-Scaling is a great feature in AWS which allows to define Cloudwatch based criteria to scale-out or scale-in automatically.

Configuration Management & Deployment

CM Tool & Model:

We had used Chef for configuration management for a long time & today also, many of the systems are deployed using Chef client/server or chef-solo. But then at some point two tears back, we decided to move to Ansible due to its simplicity & developer friendly architecture. Today, any new infra is deployed using Ansible (And in few cases we started using terraform as well, for the provisioning part). Important thing to ensure here is that the Ansible roles/ playbooks are robust enough that we don’t need any manual intervention to bring up a system. Generally, we keep separate playbooks for provisioning & configuration management as the provisioning part is generally a one time thing whereas configuration management runs regularly on the production systems. You can find some of our Ansible roles published here.

Next important thing is to define the management model. We generally have CI/CD pipelines for deploying the configuration changes in push mode but there are cases where Pull mode is must. Consider an Auto-Scaling group — Instances can come up anytime due to a scale out activity & that needs CM to run right after it comes alive. This is taken care by running the same set of configuration management playbooks as part of UDF. Planned changes are still done via pipeline, where a code commit/ merge in master branch should trigger appropriate pipeline/s which then executes the changes in all the relevant nodes in the infra.

CI/CD Pipelines:

We use Thoughtworks GOCD as the only CI/CD tool at Indix. Any app should have separate Staging & Production pipelines which automates the testing/ deployment of both Staging & Production clusters.

Security

There are few standard security measure exist in AWS like using Vpc (Classic is no longer an option anyway), using apprpriate ACLs in the subnets & appropriate inbound/outbound ports in the Security Groups. Generally ACLs are not touched and most of the Allow rules are applied to SecGroups. We use separate SecGroups for separate services so that modifying one doesn’t impact anything else. Port requirements are reviewed & SecGroups are created accordingly.

Next thing is ensure right management of secrets like database passwords & similar, which should never be stored in plaintext when used by the CM tool or deployment pipelines. For example, Encryption key in Chef (Used for encrypted databags) & Vault key (Used by Ansible Vault) should be used to encrypt thise passwords and then can be stored in version control.

Now the next tricky part is how to manage those encryption keys. Of course they should never be committed to version control but have to be transferred to an instance while bootstrapping. For this, we use Iam roles which allow AWS resources to be accessed without using Access/Secret keys. So the strategy is to store Encryption Secrets in S3 bucket & then launch instances with Iam role with appropriate plocies which can download those keys during bootstrapping.

Finally, HTTPs is becoming more & more inportant with HTTP/2 even for pages without any sensitive content. At present, we use Certificates from both ACM (AWS Certificate Manager) as well as Let’s Encrypt. Since ACM is limited to only ELB or Cloudfront, we use LE certificates for externally hosted Indix end-points.

Data Storage

At Indix, we deal with massive amount of data & generally every system in the infrastructure has to deal with some sort of relational or unstructured data. Depending on the requirement of the app, we decide which storage mechanism out of S3/ Ephemeral/ EBS/ EFS fits the bill. This generally depends on the latency of retrieval/ Durablity requirement of the data etc. While using S3, we further decide whether it should be standard S3 or RRS or IA type. Similarly for EBS, we have multiple options like old magnetic volumes or new generation storages like io1/ gp2/ st1 or sc1. If properly chosen, these decisions can have significant impact on your cost while maintaining the latency/ durability requirement of the data. Finally in few cases we use EFS as well, which is apprpriate to be used as a shared storage with less latency than S3 & can be mounted in more than one EC2 instances (Unlike EBS).

Having some idea about the growth of your data & retention is also important. Accordingly one can define proper Life Cycle policy for S3 buckets to avoid waste of dollars by storing old & unused data. Increasing capacity of EBS on the fly is possible now but nevertheless it helps to allocate the right capacity from the beginnig. EFS scales automatically so thats not a concern there.

Backup & Recovery is naturally very crucial, we need to ensure that right backup strategy is in place. For SQL data store in RDS, its important to ensure daily backup is in place. For data stored in filesystems, a regular snapshot of EBS volumes needs to be ensured. We use cronjobs & lambda functions to schedule such EBS snapshots. The concepts of MTTD/ MTTR/ RPO/ RTO appliy here similarly.

Monitoring

Monitoring requirement is analysed both from System & Application aspects. While every service should have standard system monitoring like LoadAverage/ Memory/ Disk-Space/IO etc, application layer monitoring is unique to every app & needs to be identified accordingly. Apart from monitoring the subsystems, an end-to-end check is always must.

For the internal monitoring platform, we have been using a combination of Sensu & Cloudwatch but recently we are working on migrating to Riemann which is better in terms of real-time monitoring. Also, we use Pingdom for monitoring our services from external to our infrastructure. The new monitoring platform is going to use Telegraf/Riemann/InfluxDB/Grafana where every system is supposed to be sending metrics by telegraf agent to Riemann server, which in turn has the alerts configured & also sends metrics to InfluxDB. Then create Grafana dashboards to have visualization on top of InfluxDB for analysing the trend. This also helps to do capacity planning for your systems. The Alert Notification systems are mainly Email/ Slack & pagerduty — And decided based on the severity of the issues.
For logging, we use hosted service from Loggly. We have Chef cookbook & Ansible playbook which we customize for every app which pushes the data to Loggly. Then creating dashboards with appropriate set of filters needs to be done in Loggly. Not every service is integrated with Loggly yet, but irrespective of that — it’s important to have proper log rotation strategy so that disks are not unnecessarily filled up by old & useless logs.

Performance

Performance/ Load testing an application is clearly necessary in order to identify what instance type in AWS is the right fit. There are various tools out there & we generally use Locust or Siege or Apache Benchmark for http stress testing. Both vertical & horizontal scaling needs to be tuned based on the load test results until latency meets the SLA committed to customers. Another useful tool here is gor which helps to run performance tests with the live production requests with various kind of filtering.

So this summarises the areas we analyse to make sure we have a good operable app going to be deployed in production. The points discussed above are not limited to new softwares, but can be done in more or less same fashion against the currently running systems in production as well. This is a simple checklist we follow at Indix & initiate the process during the very early phase of application development. There is always a room for improvement and this kind of process matures over time but based on our experience so far — this approach has proved extremely useful & helped us to figure lot of problems both proactively & reactively before they are deployed once they started running in production.

Interested in having a well defined Operability Review model for your organisation ? We at OpsLyft help organisations make their DevOps a success. Shoot us a mail at contact@opslyft.com for further details

Demystifying DevOps with Operability Review was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Exploring Cost Optimisation on AWS

Aayush Kumar — Mon, 25 Feb 2019 14:06:16 GMT

In the present day scenario, people really value that moment of instant gratification they get upon leveraging Infrastructure as a Service. Otherwise which would have been a significant effort for solutioning those use cases in house. Yes, we are talking about the power of Cloud Computing , which present day software product/service companies use extensively.

All of it works perfectly well for you and keeps your life easy until one fine day you start talking about things at SCALE. You start dealing with problems like storing and organising your data at large scale, making internal as well customer facing systems highly available, efficient monitoring, logging and alerting mechanisms etc. All of these things can really be a matter of few clicks if you are into using public cloud solutions for real.

So, as the legends say With Great Power Comes Great Responsibility, one can understand and work around the engineering level responsibilities for the same, but the next big thing which hits the individuals and organisations is that you need to pay your bills for all the resources you leverage on public cloud solutions. Like it or not, your cloud costs can be a bomb if you don’t plan it well.

I work extensively with AWS Cloud and to be just a little more specific, at Indix HQ, a Data as a Service company, are building world’s largest cloud catalog for structured marketplace product information. The scale of data is in hundreds of TBs which we deliver to our customers through API’s and Bulk Feeds. As a DevOps and Infrastructure person, I was responsible for designing, developing and maintaining a stable, reliable and secure cloud infrastructure for our internal and customer facing apps on AWS Cloud.

But AWS like any other public cloud service providers, comes at a cost. When your infrastructure on cloud is serving and processing data at large scale, you are sure to use resources on cloud extensively and nothing is cheap then. Your bills are in thousands of dollars then.

This is where your Finance and Engineering ninjas adopt cost control measures and try to bring standardisation in the use of resources on cloud as per common best practices. Some of the standards that we follow at Indix are:

Choosing the right instance types for deploying applications, so that there is no under utilisation of the computation power offered the selected instance type.
Using Spot Instances for systems which are not mission critical and Reserved Instances for mission critical ones.
Proper tagging of all the resources. This reduces the risk of anything getting unmonitored.
Keeping a track of service usage through Cost Explorer

One of our biggest challenge in terms of controlling costs on AWS despite of following all standard best practices for the same was that we weren’t able to track and hence respond to an unknown event. To elaborate more on this, think of the following possibilities :

An unexpected autoscaling event during 3 AM at night which scales up and then scales down your cluster which you are likely to keep a track for in terms of cost control.
Human error of EC2 instances being left idle.
Untracked jobs which have a potential to incur huge cost due to high Data Transfer.

Though AWS through Cost Explorer tool helps us analyse cost very efficiently, but we were not sure if we could leverage in a programmatic manner to build our own custom solutions on top of it. One fine day we came across an article from Jeff Barr from AWS which talks about how we can use AWS Cost Usage Report (CUR) to analyse the cost distribution on AWS. Though the blog says it all, but we wanted to build an automated system which could help us track our costs in the desired time granularity level possible.

CUR, a CSV file though, is a really complex thing and it was going to take ages for us to parse it and extract the required cost data for our use case. Alternatively, we imported our CSV data to Amazon Redshift where we can store large scale data in a Postgres database without caring about the underlying infrastructure for the same. So this way we could store our CUR data in form of a SQL table and can fetch the required data from it using simple/complex queries. It was a fair win over the effort involved in writing parsing scripts for CUR otherwise.

Added to that, AWS helps us do this entire import process by just a few clicks. It allows you store your billing reports to an S3 bucket and import to Redshift.

Configuring CUR for granularity and S3 bucket location. Image Source : AWS Blog

Post configuration of reports you’ll be able get CUR’s in your S3 bucket. We also further configured our reports to make it available into a Redshift cluster by giving a table schema for the report. (You have options to do the same in the same billing console). This meant, we just needed a new redshift cluster and we’r done on the cost data loading part.

Once we had the data in our Redshift cluster, we started running queries on it right away.

Redshift console. Image Source : AWS Blog

Everything we expected after getting inspired from the blog article was working fine for us. The journey was half complete though. What we needed was an end to end automated system which could do the above jobs for us and also alert our engineers or engineering managers whenever there is a certain threshold cost is crossed for monitored resources on AWS.

We then decided to leverage the power of AWS Lambda which is a serverless architecture service provided by AWS. We had the following design for Lambda functions :

Lambda Function #1 : Automates the process of fetching CUR from S3 and uploading it to Redshift.
Lambda Function #2 and #3 : Reads a set of specified configurations from a JSON file, frames queries from that information, compares the obtained costs from queries with some threshold data and then sends Slack alerts if the thresholds are crossed.

Let’s take a deep dive in Lambda functions #2 and #3. These are nothing but cron based functions which are responsible for alerting people on Slack (we use it for our internal office communications) whenever a resource on AWS crosses incurs more than a certain specified cost.

Configuration read by Lambda functions for creating queries

Above image shows the kind of configurations our Lambda functions #2 and #3 read for framing queries. For the above specified configs, the query formed goes something like this :

select sum(cast(lineitem_unblendedcost as float)) from #TABLE_NAME where #TAG_NAME='#TAG_VALUE';

The result of this query gives us sum which is compared against specified Threshold value and if the thresholds are crossed, alerts are being sent to specified Channel on Slack tagging the concerned Engineering Manager(EM_Name). Following is one of the sample alert :

So this is it :) Over the day, if the cost shows an unexpected behaviour over our monitored resources, we get to know for sure :)

Overall graphical representation of the architecture is something like this :

Architecture Diagram for Cost Alerting System

I named it up as PLUTUS :) Soon enough, at OpsLyft we are going to release this as a full fledged product which you can use it within your AWS environment.

Our vision at OpsLyft is to achieve Digital Transformation with DevOps. We are set of experienced engineers who help organisations achieve their DevOps goals. Please reach out to us at contact@opslyft.com for for any help you need on solving problems and enhancing your cloud.

Exploring Cost Optimisation on AWS was originally published in FinOps Talks on Medium, where people are continuing the conversation by highlighting and responding to this story.

Tale of Configuration Management : Our Ansible Story

Aayush Kumar — Thu, 18 Jan 2018 23:23:56 GMT

Like in any other fast growing product startup, Indix’s tech stack also evolved with time. We iterated over and over (sometimes from scratch) as we scaled to achieve our engineering milestones. These milestones were more of definitive engineering processes and practices which were validated and confirmed to be fitting our use cases. One such evolution which we experienced was of Configuration Management.
This article talks about Indix HQ’s journey of Configuration Management so far and how we ended up using Ansible as a standard for infrastructue creation, provisioning and deployments.

Previous State

As of circa Aug 2015, we at Indix used a variety of tools and languages to take care of infrastructure creation, provisioning and deployment.

Infrastructure Creation: AWS command line tools, fog, knife-ec2
Provisioning: chef, chef-solo
Deployment & Orchestration: Capistrano, fabric, shell scripts, Go-CD commands, beanstalk etc.

There were some shortcomings with this approach:

Failures happened since there was no tight coupling between infra creation, provisioning and deployment. Failure at one stage leads to cascading failures later.
The tooling to get things in production is pretty complex. Understanding multiple tools and getting familiarity with a new language ( ruby / python ) for most developers is not optimal. This leads to less participation from a larger group of developers in taking care of the systems built. Also it causes a lot of burnout for the DevOps team.
We didn’t have any single place to know the source of truth of the system. We largely use chef in a pull based mode. This implies that agents keep polling central chef server to know what changes are required to be applied to the system. We experienced state discrepancies where github code, chef server state and machine internal state don’t match. This happened either due to lack of proper checks and balances in the current way of working, developer negligence or sometimes even due to failure during execution of chef-agent which fails silently without raising any alerts. The end result of this invariably leads to production downtime / failure of some or many of our services. This apart we have also seen central chef-server becoming the single point of failure leading to bad state of our systems.

Our Experience & Learning

Too many tools to learn which have significant learning curve. Also very tight coupling of provisioning across different teams / projects.
Extremely difficult to quickly create a exact replica of production environment to experimental so to try new code changes.
Difficult to tweak with hardware ( amazon machine types & volume types ) which is required for any tuning and battle-hardening any backend system .
No consistency or simple way to configure some of the basic things ranging from monitoring, logging, system sanity checks
Neither idempotent nor immutable. This implies that tomorrow we can’t simply rerun our entire infra creation, provisioning and deployment in a single way multiple times and be certain that things will keep on working fine.

So to paraphrase we wanted a system which should be:

Dead simple. This means very low learning curve and more developer participation. The indirect consequence of it is helping people move fast and get things done rather than being dependent on the devops team entirely.
Single (or minimal ) tool for infra creation, provisioning and deployment
One way to do things. Not 5 different languages and frameworks. Something like YAML.
Building idempotent and immutable infrastructure. It means you can run the same script to create, provision and deploy infra multiple times and it should simply work. This also means that recreating infra, provisioning an deployment shouldn’t be a exercise in frustration but becomes a part of the way we operate
The above point ensures that it is easy to replicate complete environments with minimal overhead. This also ensures that our continuous delivery system is used to just trigger and doesn’t have any state or intelligence
Ability to quickly create local dev setup, test environments and write integration tests

Ansible Ecosystem

We actively followed ansible for over a month plus before seriously trying it out. It seems to solve all the problems that were described earlier in a really simple and elegant fashion.

The major problems that it addresses are:

Single place to create infra, provision, deploy and orchestrate.
You write only in YAML.
Primary mode of operation is push based mode which means post deployment system is immutable. It also gives flexibility to be used in pull based mode which is useful for say Hadoop cluster where nodes may come and go. However in both cases github source is single source of truth.
Great documentation and solid pre built modules. So for complex things like machine creation and ensuring that only a single machine in a group exists you dont’ have to do anything. Most modules written are idempotent. This is not true just for provisioning ( where it is true for chef & puppet also ) but for even infra creation and deployment.

Current State :

As of now, most of our infra is under configuration management via Ansible. This includes migrating legacy systems from no configuration management to Ansible based setup and a few Chef based setups to Ansible based. Any new system that’s being built by developers is brought up using Ansible.
Owing to it’s simplicity, the best and perhaps the most important milestone that we are able achieve is that developers now build their systems end to end, meaning that they not only focus on writing application code but also write Ansible scripts for the application’s infrastructure creation, provisioning and deployments. This is a notable cultural practice which has been standardised which also eases and distributes out the ownership and responsibility of DevOps team with developers.

I would like to thank Varun Jain, ex-Indixer, (now CEO, SendX) who played a key role in incepting the idea for adoption of Ansible as an organisation wide Configuration Management tool. Most of the points covered in this article are inspired from his analysis presented to us for motivation towards Ansible adoption.

Demystifying DevOps with Operability Review

Aayush Kumar — Sun, 29 Oct 2017 13:53:13 GMT

As a part of Software Development Lifecycle, the major focus of developers is mostly around the coding, optimising and testing the business logic. However, while or even before writing code for the application, not much attention is given to what would happen when application as a dependent/independent system would be up & running in production. This leads to lot of complexities, inefficiencies, unwanted surprises and challenges when the code is pushed to production — most of which could have been eliminated or reduced had we thought of them little early. There comes the necessity of Operability Review. By the term Software Operability — we mean the readiness of a new software to be deployed in production and start handling real traffic. The whole purpose of doing Operability Review of a service is to ensure that it runs Well once deployed, operates as smoothy as possible, very minimal manual intervention is needed to make it up & running.
At Indix HQ, we take operability seriously and engage everyone to have regular reviews around the operability items so that we don’t miss anything important and discover them late. In this blog, we will discuss the process we follow, the areas we focus while doing such reviews and how good this has proved to our product life cycle.

Elements of Operability Review

System Architecture
Configuration management & deployment
Security
Data storage
Monitoring
Performance

Image Credits : Arijit Bhattacharyya

System Architecture

Architecture & Deployment diagram:

Deployment platform:

Load Balancing:

If the service runs on multiple instances & needs load-balancing, then some of the important things to decide for the ELB are -
Is the service user-facing ? If not, make sure to use internal ELB (Routing via private IPs) as network packets will traverse via public internet otherwise (which doesn’t make sense & might have security concerns).
Make sure to enable Cross-AZ load balancing. This helps to distribute requests among the registered nodes irrespective of whether they are evenly spread across the AZs or not.
Enable Connection Draining with appropriate Timeout settings. This is to ensure no interruption of the inflight requests when an instance is being taken OOR for maintenance or something.
Set Idle Timeout appropriately. Idle Timeout is the duration of a connection can be in idle state before the ELB closes the socket. This applies to both client side as well as EC2 side sockets.
Enable Access Log for ELB. This is very important as without the access logs, it becomes too difficult to troubleshoot issues in production systems when the requests are coming through load balancer. We have a S3 bucket which is designated to store logs of every ELBs under top level folders/ paths. It might also be useful to push these logs to your centralized logging systems for further analytics.
Next is the Stickiness settings. Generally speaking, Stickiness is against the very concept of load balancing but sometimes one may need to use session affinity to ensure that every client requests in a session is routed to a particular instance. This is generally implemented using AWSELB/ application Cookies.
Finally, what kind of load balancer is needed. If we need features like content based routing or multiple listener ports per EC2 nodes — ALB is the choice. But note that ALB doesn’t support TCP listener yet so only HTTP/s protocols can be used. Use ELB otherwise.

High Availability:

Deploy your instances in multiple regions & in multiple AZs in each of the regions.
Figure out any SPOF in the deployment. If that blocks the real time response of the app, then this is not acceptable. We need to ensure replica of the same exists & is distributed across independent network infra. If the service is using RDBMS, then we generally use RDS which has Multi-AZ deployments. That ensures synchronous replication of the data to a standby instance in a different Availability Zone (AZ).
The last but probably the most important is — Choice of EC2 Purchase type. We use Spot instances heavily to reduce our cost but that brings the additional difficulty of maintaining high availability. Depending on the nature, it may not be possible to use spot instances at all but most of the times the strategy of hybrid model works.
We imitate real production scenarios & try to estimate the numbers for MTTD (Mean-time-to-detect)/ MTTR (Mean-time-to-recover)/ RPO (recovery point objective)/ RTO (recovery time objective) for our services.

Scalability:

Configuration Management & Deployment

CM Tool & Model:

CI/CD Pipelines:

We use Thoughtworks GOCD as the only CI/CD tool at Indix. Any app should have separate Staging & Production pipelines which automates the testing/ deployment of both Staging & Production clusters.

Security

Data Storage

Monitoring

For the internal monitoring platform, we have been using a combination of Sensu & Cloudwatch but recently we are working on migrating to Riemann which is better in terms of real-time monitoring. Also, we use Pingdom for monitoring our services from external to our infrastructure. The new monitoring platform is going to use Telegraf/Riemann/InfluxDB/Grafana where every system is supposed to be sending metrics by telegraf agent to Riemann server, which in turn has the alerts configured & also sends metrics to InfluxDB. Then create Grafana dashboards to have visualization on top of InfluxDB for analysing the trend. This also helps to do capacity planning for your systems. The Alert Notification systems are mainly Email/ Slack & pagerduty — And decided based on the severity of the issues.
For logging, we use hosted service from Loggly. We have Chef cookbook & Ansible playbook which we customize for every app which pushes the data to Loggly. Then creating dashboards with appropriate set of filters needs to be done in Loggly. Not every service is integrated with Loggly yet, but irrespective of that — it’s important to have proper log rotation strategy so that disks are not unnecessarily filled up by old & useless logs.

Performance

Thanks to Arijit Bhattacharyya our Director of Engineering, DevOps for laying the foundation of Operability Review process at Indix. All the practices mentioned in the blog are implemented as a standard checklist for any system (internal or customer facing) as a part of his review process only :)