Exploring a Tech Stack : Part 1A— Production Environment

My Journey Into a World-class Website’s Tech Stack

14 min readJul 18, 2020

Introduction

Lately, I’ve started looking at the different tech stacks that organizations and companies work with. I’m not sure how recently the term tech stack was coined, but its a convenient term to encapsulate the technologies being used. It makes it easy to say a lot in a little time about an organization.

I can get a feel of how recently they started building, whether they are more technologically conservative or riding the cutting edge. If their stack is deep and includes a well-thought out set of devop tools and utilities, or if its barebones can say something about the evolutionary stage they are in.

A website like stackshare.io is a boon to curious developers. Entrepreneurial developers looking to build their own applications or companies can learn from the experiences of other organizations to reduce less-the-informed decisions that will stunt future growth. Job-seeking developers can drill down, hyperfocusing on skills and technologies at XYZ-Dream Company. Whilst, architects, SMEs and drew-the-short-straw, treading water system designers can find best practices and make better informed decisions on the next additions to their stack. Sarcasm aside, let’s take a look.

The Plan

Off the top of my head, when you say tech stack I would think its made up:

programming language(s)
web framework(s)
database management system(s)
web server, proxy and content delivery networks
logging, testing and devop tools

Which, as a generality, is truth. But, what I’ll be doing over the course of this week is going through Medium’s tech stack that Dan Pupius wrote about back in 2015, in the article “The Stack That Helped Medium Drive 2.6 Millennia of Reading Time”.

Along the way, I’ll explore and share my thoughts and mental wanderings as I seek to learn more about the purpose and power of the tools used to run Medium.

I believe it will be an invigorating adventure of the mind and welcome you to join me! 🧗🏻‍♂️

The Production Environment

The first thing that Dan talks about is the production environment, which consists of:

Amazon Virtual Private Cloud
Ansible
Node
Go
CloudFlare
Fastly
CloudFront
Nginx
HAProxy
Datadog
PagerDuty
Elasticsearch
Logstash
Kibana

Before diving into the stack, three questions questions come to my mind.

1. What is the production environment?

An environment is a computer system where an application is being run. A production environment is the environment where end-users are able to interact with your application. Microsoft’s BizTalk docs define it as the “final endpoint in the release management process and should only host […] applications that have previously undergone development, unit testing, load testing, and staging in other environments. Thorough unit testing, load testing, and staging beforehand will help ensure maximum performance and uptime for the […] application in the production environment.”[1]

2. What other environments exist?

Generally, there are 4 types of environments. They are:

Development
Test
Staging
Production

3. What happens in each of these environments?

Each environment has a specific purpose in the process of application development. In the process of software development, regardless of whether the development methodology is Waterfall or Agile, there are these 7 steps [2]:

Conception
Initiation
Analysis
Design
Construction
Testing
Deployment

Steps 1 through 4 take place outside the environments.

Development: Step 5 is the development/coding stage and occurs in the development environment. The development environment can further be separated into the environment (laptop/desktop) of each individual developer and also into the environment where changes are aggregated across a working unit, multiple units or an entire organization. This is where your Version Control System (VCS) like Git will manage changes to the code base. It is also where unit testing is implemented. A unit test will test individual parts of an application, this can be a single method, procedure or function. [3]

Testing: Step 6 is the testing stage and this can encompasses many different types of testing depending on the size and needs of the organization. Some examples of testing are:

Integration Testing — units are combined and tested as a group [3].
System Testing — entire software is put together and tested [3].
Performance Testing —load testing ensures normal operation under the expected workload limits, other tests can examine performance under sustained usage and unexpected levels of workload [5]
User Acceptance Testing — certifying that software changes work as expected by having knowledgeable end-users use the application. [4].

Staging: The staging environment is a clone of your production environment, which acts as a bridging point between the testing and production environments. It is used as a “test sandbox that is isolated from the production environment. It can be used to try out new features or functions with real data without impacting the production database”. [IBM, 6] It is also where a release candidate (next version) of an application resides for final testing and approval before being launched to production.

Production: The final destination for applications is the production environment which faces your end users. At this stage it is “producing” that something of value it was developed for.

Now, back to the tools used in the Production Environment. To begin, I’ll break up list into sub-categories for readability and then let’s look at each one.

Servers and Hosting— Amazon Virtual Private Cloud (w/Ansible), CloudFlare, Fastly, CloudFront, Nginx and HAProxy

Programming Languages — Node, Go

Logging — Datadog, PagerDuty, ELK (Elasticsearch, Logstash, Kibana)

What is Amazon Virtual Private Cloud?

“Amazon VPC lets you provision a logically isolated section of the Amazon Web Services (AWS) cloud where you can launch AWS resources in a virtual network that you define.” [7] Logically isolated means that although your cloud, or virtual network, may be separated from other networks using the same machine it is a not on a standalone server, also known as a bare metal server.

Logical Isolation

The logical isolation between your cloud and another is provided by the functionality of a hypervisor. The hypervisor is a software layer that manages the separation of computer resources. The term hypervisor was invented by IBM in the 1960s during the mainframe era [8]. A good introductory guide to Hypervisors is available at IBM’s Cloud Learn Hub.

Isolation is provided on Amazon VPC through the use of public and private subnets (range of IP addresses), security groups and route tables. Individual resources can be isolated from one another, or the Internet, allowing control over which services can be directly connected to from outside the virtual network [9].

There are many ways to connect separate VPC instances, as well as the User’s networks to Amazon VPC, covered in detail Amazon’s document “Amazon Virtual Private Cloud Connectivity Options” (Jan. 2018), but beyond my personal research’s scope of interest.

According to the Amazon AWS Security whitepaper [10], a standard EC2 instance is assigned random public IP address. However, VPC allows administrators to instantiate EC2 resources to specific IP addresses within the VPC allowing for routing and security between instances, as well as to the Internet.

In Medium’s case, Dan seems to suggest that they have “about a dozen production services” (instances or servers) running in their VPC. Some larger, and others playing smaller, more specific roles. It also appears they use Node.js for their main servers, but also use the Go programming language for auxiliary services to gain performance boosts.

What is Ansible?

Dan mentions the use of “Ansible for system management” and configuration source control. I imagine this means, that similarly to how git can track changes to source code, Ansible, can track server configurations in a systematic way.

Red Hat lists four major needs of IT management that Ansible addresses:

Cloud Provisioning

Provisioning is the process of setting up an instance of a server or service. This includes all the steps to have a basic level of functionality and accessibility. Red Hat defines four types of provisioning [13]:

Server provisioning — get the server up and running
User provisioning — add users, roles and permissions
Network provisioning — setup network access
Service provisioning — setup auxiliary services and data

Configuration Management

After the systems are up and running, Ansible provides a reliable way to manage the configurations of each node following a particular template, or individually.

It does this by connecting to each machine (node) that the administrator specifies in an inventory file. After connecting, using the SSH (Secure Shell) protocol (or passwords) Ansible will run a series of commands to align that machine’s state with the desired configuration. [11]

Application Deployment

Ansible can also automate the deployment of applications by following a Playbook that you, as an administrator, provides. “Playbooks are Ansible’s configuration, deployment and orchestration language” [12] and are created using YAML.

The example Red Hat provides that succinctly describes the functionality is:

Step 1. Log into an application server.

Step 2. Turn off logging.

Step 3. Remove from load balancing.

Step 4. Deploy application.

Step 5. Add to load balancing.

Step 6. Turn on logging.

Step 7. Move on to next 10 servers, repeating the preceding steps for each.

Intraservice Orchestration

In the case of an website like Medium.com, Dan mentions at least a dozen production services, possibly on multiple machines, running in different environments. In the case of a single launch of a new version of the production (live) site, there might be a hundred tasks that need to go off, in order, in the frontend and backend, on different networks and everything needs to be “orchestrated” to prevent a fustercluck and a call from the ceo.

Red Hat makes clear that orchestration tools exist, but Ansible’s usefulness is in that it can be used to “orchestrate different conductors in different environments.” [14]

Ansible Extras

Ansible also comes with Ansible Tower and Ansible Analytics. These are visual dashboards, Tower provides a gui for such things as controlling access to Ansible commands, scheduling and running jobs and graphical inventory management. Ansible Analytics is a dashboard gui for logs and statistics tracking Ansible performance.

On Cloudflare, Fastly, CloudFront

Dan estimated 90% of Medium’s static assets were sent through Cloudflare (in 2015), with 5% going to Fastly and CloudFront. I didn’t have too much interest in the topic of Content Delivery Networks (CDNs). You upload, it distributes. Yada, yada, yada. Cloudflare’s Argo Smart Routing seems interesting for the speed-obsessed and I skimmed this DDoS attack tutorial and this blog post “Comparing HTTP/3 vs. HTTP/2 Performance” by Sreeni Tellakula.

Nginx and HAProxy

Dan doesn’t go into too much detail about the exact server setup, but mentions Nginx and HAProxy for “reverse proxies and load balancers”. I find these tools much more interesting and I immediately have three questions:

What is a proxy?
What is a load balancer?
What’s the difference between Nginx and HAProxy?

What is a Proxy?

Let’s take a step back and ask the question, what is a proxy? I think a good way to describe it is to relate it to other things we are more familiar with. Perhaps, you’ve designated a health care proxy, which is somebody who can make medical decisions for you, in the case you cannot make them for yourself. There is also the power of attorney, which is another form of a proxy — the authority to act in place of somebody else.

Forward Proxy

Cloudflare summarizes it this way, “A forward proxy, often called a proxy, proxy server, or web proxy, is a server that sits in front of a group of client machines. When those computers make requests to sites and services on the Internet, the proxy server intercepts those requests and then communicates with web servers on behalf of those clients, like a middleman.” [15]

A common use case for a forward proxy is in academic and corporate computer network systems. These networks may block outgoing Internet traffic to prohibited sites, limited access to social media, gambling or pornography sites.

The proxy is located on network in-between the Internet and the user’s computer. When a request from the user requests a website from their computer, the request enter’s the local network, where it is directed to the proxy server. Once received by the proxy, the proxy determines whether the traffic is allowed to continue to the Internet and if so, passes on the request to the web address that was requested. If not, it is blocked.

Other reasons that proxy servers are used are:

Speed improvements by caching data at the network edge.
Security benefits by encrypting requests and blocking unknown websites
Related to security, is strengthened privacy and location obfuscation

The destination server will only interact with the proxy, and so user’s machine behind the proxy will remain hidden from view. Their IP address and other identifying information will not be accessible by the destination server.

Reverse Proxy

The reverse proxy’s role is different in that it sits behind a firewall and in front of the destination server, and/or its network and protects the destination server from having direct interaction with users. Once a user’s request reaches the reverse proxy, the proxy has the responsibility of directing (or rejecting) that request within the network.

Because of the reverse proxy server’s role as a doorman of sorts, who monitors all incoming traffic and directing to an appropriate destination there are certain benefits that are not available if a user is able to directly interact with the destination server.

The benefits that Cloudflare lists for using a reverse proxy are [15]:

Caching — the proxy can cache and compress commonly or recently accessed data
SSL Encryption/Decryption — the destination server can outsource the encryption and decryption of SSL/TLS to the cache.
Protection from attacks (DDOS) — because bad actors can not see the actual IP address of the destination server, they cannot flood it with requests to bog it down.
Load balancing — lastly, because the public only ever sees the proxy server, we can implement multiple servers behind the scenes. As requests come in the proxy server directs the traffic to different servers according to server load, geographic location, round-robin, etc. The user never needs to request a different web address because traffic only goes in and out through the proxy.

What is a load balancer?

Load balancing solutions seek to distribute data requests between clusters of servers. In the olden days a single server could get bogged down by having too many requests in a short time frame [17]. The load balancer was developed to prevent this by providing different strategies to distribute load.

KeyCDN describes these three strategies:

Round-robin — traverse a list of servers, sending requests to each in turn
IP Hash — use the client (user) IP address to determine the best server to query
Least connections — send the request to the server with the least connections.
Lease response time — the server with fastest response time is queried

What’s the difference between Nginx and HAProxy?

The key difference between these two tools is that Nginx (pronounced Engine X) is a web server. It can be used as a web server, but also as a reverse proxy and load balancer. HAProxy is a load balancer and as its name suggests, a proxy, but cannot act as a web server.

The analysis by KeyCDN of these two tools shows the major difference between the two is that Nginx can do more than load balancing, but its logging abilities are weak. HAProxy can’t act as a web server, but its logging abilities are strong.

Datadog and PagerDuty

PagerDuty

I had never heard of PagerDuty before, but its list of clients is quite long, and no matter how cynical you are of a companies marketing claims of the businesses that are actually using them at the moment, you have to admit PagerDuty’s list of integrations is impressive.

From what I can tell Pager duty looks something like central station monitoring. Basically it hooks into your services, your logs, and the tools that monitor and aggregate that data and triggers alerts to dispatch designated staff to respond. There are also integrations with communication tools like Slack and Jira to send notifications there. It looks they specialize in this very niche market of bringing together many different tools.

Datadog

One of the leading services for log analysis and monitoring is Datadog which takes log data that might look like any of these log files and gives graphical visuality to the immense, dense and overwhelmingly complex data and their relationship to other each other.

Datadog boasts 400+ integrations of metrics and events from tools and services in this arena. By aggregating the data and providing the ability to drill down and visually analyze performance by server, service and user it allows for automation of alerts that go off under very specific circumstances.

A commonality I see between Datadog and PagerDuty is the focus on integrating as many SaaS and Cloud providers, automation and monitoring tools and services, VCS and bug tracking systems and DBMSes. The PagerDuty’s Datadog Integration Guide lists the benefits of combining the tools as:

Based on alerts triggered by Datadog, PagerDuty alerts can be triggered and contact specified “on-call responders” .
These PagerDuty alerts can include Datadog’s rich set of visualizations to give us more clarity on the issue.
Back and forth interaction between the two services to update status on both systems and indicate severity level of the alert.

Digging deeper into Datadog’s capabilities, I can see it is a much broader tool than PagerDuty that not only provides visualizations for the raw server and logging data of its integrations, but provides a way to analyze it. They include ways to hook into your applications and “debug” your applications real time.

Datadog allows organizations to set performance goals, track network and server performance and raising alerts if conditions are not being met. It enables the monitoring and analysis of “end-to-end user experience” and combine backend performance metrics and do integrated analysis with frontend customer experience metrics like those from Google Analytics, providing a way to examine how backend performance affects business metrics.

If you think of the odometer of a car being like a log telling you how many miles it has been driven, this might be how you might look at a server log. Data without relationship and without seeing whats going on in other parts of the car. If Datadog was available for automobiles, I’d imagine it would have sensors hooked to every sensor, relaying information about the effects of using gasoline of different octanes, performance under different weather conditions and the ability to see the data visually.

I’ve not used Datadog personally, so I can’t attest to the veracity of their marketing claims. I’ve seen enough tech companies that tout products that don’t quite match up to their snuff and that propose ML/AI solutions that are really just manual solutions dressed up and advertised well, BUT in this case, the sheer amount of large-scale (mid and small-scale) companies and organizations using Datadog attest that its certainly doing something right.

ELK (Elasticsearch, Logstash, Kibana)

My plan was to cover the whole Production environment stack of Dan’s article this week but this article has gotten quite long already. This last part on Elasticsearch, Logstash and Kibana is a deep subject, and it looks quite interesting. Next week, I’ll explore it in more depth.

Have a nice weekend! 🌞

[update: This article, part 1B, is now available here: https://medium.com/@dannylee8/exploring-a-tech-stack-part-1b-production-environment-1c29d2185b9f]

🦶🎶:

“Planning the Development, Testing, Staging, and Production Environments.” Contributors. Microsoft. June 8, 2017. https://docs.microsoft.com/en-us/biztalk/technical-guides/planning-the-development-testing-staging-and-production-environments#production-environment. Accessed 7/15/20.
“Waterfall vs. Agile: Which is the Right Development Methodology for Your Project?” Mary Lotz. Segue Technologies. July 5, 2018. https://www.seguetech.com/waterfall-vs-agile-methodology/. Accessed 7/15/20.
“Software Testing Fundamentals”. http://softwaretestingfundamentals.com/. Accessed 7/15/20.
“Guidance for business testing (UAT) during migration.” Contributors. Microsoft. April, 4, 2019. https://docs.microsoft.com/en-us/azure/cloud-adoption-framework/migrate/migration-considerations/optimize/business-test. Accessed 7/15/20.
“Start Performance Testing Early”. Ruchi Kansal. IBM. https://www.ibm.com/garage/method/practices/code/practice_performance_testing/. Accessed 7/15/20
“Setting up a test staging environment with production data”. IBM Knowledge Center. IBM. https://www.ibm.com/support/knowledgecenter/SSYMRC_7.0.1/com.ibm.jazz.install.doc/topics/t_prepare_sandbox_server_rename.html. Accessed 7/15/20
“Amazon VPC — General Questions”. Amazon. https://aws.amazon.com/vpc/faqs/. Accessed 7/15/20.
“Hypervisors”. IBM Cloud Education. IBM. https://www.ibm.com/cloud/learn/hypervisors. Accessed 7/15/20.
“Administration Guide — Network Isolation”. AWS Documentation. Amazon. https://docs.aws.amazon.com/appstream2/latest/developerguide/network-isolation.html. Accessed 7/15/20.
“Amazon Web Services: Overview of Security Processes.” AWS. Amazon. March 2020. https://d1.awsstatic.com/whitepapers/aws-security-whitepaper.pdf. Accessed 7/15/20.
“Overview: How Ansible Works”. Red Hat Ansible. https://www.ansible.com/overview/how-ansible-works. Accessed 7/15/20.
“Working with Playbooks.” Red Hat Ansible. https://docs.ansible.com/ansible/latest/user_guide/playbooks.html. Accessed 7/15/20.
“Automation — What is Provisioning”. Red Hat. https://www.redhat.com/en/topics/automation/what-is-provisioning. Accessed 7/15/20.
“Use Case: Orchestration”. Red Hat Ansible. https://www.ansible.com/use-cases/orchestration. Accessed 7/15/20.
“What Is A Reverse Proxy? | Proxy Servers Explained”. Cloudflare. https://www.cloudflare.com/learning/cdn/glossary/reverse-proxy/. Accessed 7/16/20.
“What is a Reverse Proxy Service?” NGINX. https://www.nginx.com/resources/glossary/reverse-proxy-server/. Access 7/16/20.
“HAProxy vs Nginx — The Case for Both” KeyCDN. 10/4/18. https://www.keycdn.com/support/haproxy-vs-nginx#:~:text=The%20major%20benefit%20of%20Nginx,might%20be%20the%20right%20choice.