Learning.com’s State of DevOps

Published in

Learning.com Tech Blog

6 min readFeb 15, 2018

Like many software companies today, Learning.com is transitioning to DevOps. The ability to deliver quicker bug fixes, features, and infrastructure changes are among the end goals. The path to get there has involved new methodologies, procedures, and software tools. In this blog post I will be sharing the story of our DevOps journey: where we were, what happened, where we are now, and what the future holds.

What is DevOps?

According to Amazon Web Services:

DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity: evolving and improving products at a faster pace than organizations using traditional software development and infrastructure management processes. This speed enables organizations to better serve their customers and compete more effectively in the market. In other words, we wanted to make the transition from a slower and more reactive department into a faster, dynamic, and proactive team.

Where we were — Before DevOps — Fall 2015

At this point in time Learning.com’s infrastructure was comprised of individual snowflake servers that each housed one (or multiple) parts of our platform. The architecture could be dubbed “monolithic” or “legacy.” The deployment process used custom PowerShell scripts written for the specific environment/servers. There was no configuration management. DevOps wasn’t even on the radar at this point. The situation showed a lot of potential for improvement. As a department, operations was called WebOps. The team was newly formed in 2015. Prior to this, the development team had assumed most of the roles and responsibilities of website and server administration. Learning.com was hosted entirely by a managed service provider in a single data center, not far away from Portland, Oregon. This had some perks:

They owned and managed the computers/network, power, and cooling: This saved the company money by not having to purchase and maintain hardware
Multiple WAN options: For fault-tolerance and load-balancing
Power resiliency: Same
Secured facility: Knowing that the hardware is physically safe from outsiders
Patching: The MSP took care of patching the guests

It also had some drawbacks:

Cost: The monthy operational expenditure was a lot
Lack of control deeper than the software level: since we only had access starting at the OS level, any hypervisor or VM level issue was out of our control
Lack of monitoring: since we had no visibility at the hypervisor level, it was impossible for us to monitor and troubleshoot VM host issues, let alone network issues. This led to support tickets questioning the reliability of those promised resources
Wait time for requests to be fulfilled: Any new server provisioning, network change, or storage change was out of our control and thus required a support ticket to be opened and waited for. The infrastructure we were running on was holding us back. Waiting half a day for a DNS change, or a day for a new VM was unacceptable. We needed something that allowed us to move at greater speed.

What happened — Amazon Web Services — Summer 2016

By now, AWS — the largest public cloud with the most tools and resources to offer — was the elephant in the room. It was only natural for Learning.com to begin setting up a proof-of-concept in AWS. This is where we began trying to drink from the firehose. So much of AWS is unknown territory to a traditional sysadmin or operations engineer that has never been in the cloud. The ability to automate everything was mind-boggling at first. We needed to understand how concepts like shared tenancy, object storage, and virtualized networking applied to this brand new environment.

We ran into a few snags, the largest being the lack of shared storage, which ended up changing the way we cluster MSSQL Server entirely. A consulting group came in and gave us pointers, recommendations, and some great guidance on storage performance for our database servers. Most importantly though, our software architect engineered the way EC2 instances would be provisioned and boot-strapped, and then have their software configuration laid down — all through code. We began using CloudFormation, writing .json files to describe the exact AWS resources we wanted. Adding to that, we used a Python library called troposphere that could generate the .json for us from simpler, cleaner Python. This took care of the VPC, customer/VPN/internet gateways, route tables, subnets, and security groups. Provisioned EC2 instances used cfn helper scripts to call PowerShell or Bash scripts that renamed instances, joined them to the domain (if Windows), installed Puppet open-source, and then ran a puppet-apply. We started to use a git repository to keep track of all the code. We used Atlassian Bamboo as a means for CI/CD. It did the job of grabbing the latest code (from both Learning.com’s platform and infrastructure codebases) and making sure it was deployed onto new instances upon initial server creation.

Before we really knew what was going on, our entire tool set and method for infrastructure and server deployment had changed. Shortly after this our department adopted Scrum and two week sprints became the norm. This was the first big step into DevOps for us. We adopted the tools — that was the easy part. We had started to adopt the practices too — our infrastructure was now all code, and Agile was now our framework for project management. We were on the right track, but we still had a lot of learning to do.

Where we are now

Fast forward to 2018. We have a much greater grasp on the tools and practices after using them for a year and a half. VS Code is my go-to editor. Reviewing pull requests is a normal, everyday thing. We have developed better processes for working in AWS and what troubleshooting steps are necessary when things go awry. Now it’s easy to debug where something goes wrong in the deployment process. Since the move we have setup a Docker Swarm environment. Now all new development is being done in containers, as micro-services. This further eases testing and deployments, while moving us away from the monolithic architecture. We have also moved our test and stage environments out to AWS to mimic the production environment as closely as possible. We have spun up an entire disaster recovery environment in another AWS region as well. When we have a large increase in traffic, or when we just need more web servers in our load balancer due to unknown effects from Meltdown/Spectre, we are able to deliver quickly. In doing all of these projects we have gained a better understanding of the software life cycle, hands on. Eventually the WebOps department changed its name to DevOps. The knowledge we gained from working together has also increased trust and the levels of responsibility between operations and development. Communications between the departments increased too, as DevOps is more capable to handle questions, issues, and requests.

What the future holds

Our biggest project on the horizon is Puppet Enterprise. We will move from master-less open-source v3 to a full-fledged PE master v5. This will change the deployment process some, but will allow us to use some much-wanted features (like faster debugging of Puppet code). A project of this size and scope would not have even been possible for us a year ago. With the confidence we have today, I am optimistic about the future of DevOps at Learning.com. The most important changes going forward are going to be that of process and philosophy. The development and DevOps departments must become even more intertwined in their work. It’s one thing to label one-self “DevOps” — it is a much larger thing to actually be DevOps.