Exploring a Tech Stack : Part 1B — Production Environment

My Journey Into a World-class Website’s Tech Stack

15 min readJul 27, 2020

Last week, I explored the production environment of Medium.com, that D.P. documented in “The Stack That Helped Medium Drive 2.6 Millennia of Reading Time” (2015). The one part I left out was the ELK (Elastic, Logstash and Kibana) stack. The subject was much broader subject than expected and my post had already quite long. This week, I’ll finish off my learnings on the Production Environment with:

The Elastic Stack (Elasticsearch, Kibana, Beats and Logstash)

Dan’s article is a bit aged in terms of technology and 5 years can seem like ages. One recent change to the ELK stack is the addition of Beats. Rather than renaming ELK to ELKb or bELK, Elastic decided to rename the bundle to the elegant-to-the-ear phrase — Elastic Stack.

Dan had mentioned in passing that they had been using Datadog and Pagerduty, but “now heavily use ELK (Elasticsearch, Logstash, Kibana) for debugging production issues.” [1] The more I learn about ELK, I can’t help imagine this has accelerated.

Elastic has become a powerhouse of analytics and log management. The website Stackshare has the ELK stack categorized under Log Management, which feels limiting given the stack’s capabilities. Elastic’s Dave Erickson said “Elastic is a search company” in his (informative) presentation about the Elastic Stack. [2] And, Elastic’s YouTube video “What is Elasticsearch?”shares developers’ difficulty in describing what exactly it does. A search engine at heart, but capable of so much more.

Beyond log management, analysis and visualization, this combination of products can also be used for business need analysis and prediction. I’d extend that statement not just to business needs, but any field where large amounts of data need to be captured, filtered, searched through and setup for predictive, outlier and ML analysis which can be visualized through dashboards.

A Look at the Components of the Elastic Stack

Let’s take a look at each of the parts of the Elastic stack. These are:

Elasticsearch — an open-source, distributed, horizontally scaling, resilient search engine using structured, schema-free (JSON) document-style storage accessible via a RESTful API built on Apache Lucene. [3]
Kibana — an open-source front-end providing two sets of functionality. The first, as a visualization tool for data collected and indexed by Elasticsearch. The second, an administrative tool for managing (an) Elasticsearch cluster(s) of node(s) and monitoring health and performance. [4]
Beats — the newest edition to the stack. Made up of a collection of “data shippers”. Light on resources and small in file size, they simplify the importation of data into Elasticsearch. [5] There are seven that come prepackaged with Beats — Filebeat, Metricbeat, Packetbeat, Winlogbeat, Auditbeat, Heartbeat and Functionbeat. There is also a Community Beats archive which has another 96 user-created Beats.
Logstash — PagerDuty’s integration page for Logstash describes it as: “a powerful pipeline for storing, querying, and analyzing your logs. When using Elasticsearch as a backend data store and Kibana as a front-end reporting tool, Logstash acts as the workhorse. It includes an arsenal of built-in inputs, filters, codecs, and outputs, enabling you to harness some powerful functionality with a small amount of effort.” [6]

As much as I tried to understand and summarize the tools simply and clearly, I failed over and over. There’s a lot of technical terms that aren’t very clear to me and each component of the stack is a complicated, multi-faceted subject.

Let’s take a look at each piece individually and break down some terms.

Elasticsearch

The terms that stumbled me here are:

Open-source
Distributed
Horizontally scaling
Resilient
Search engine
Structured, schema-free (JSON) document-style storage
RESTful API accessible
Built on Apache Lucene

Open-source

There are some implications of being open-source. Not only is it free to try and use, but there’s likely a community that supports and builds on top of it. Being open-source likely contributed to Elastic stack’s widespread adoption and hopefully will continue to fuel growth and evolution of the software.

Earlier, I mentioned Elastic’s Dave Erickson when talking about what Elasticsearch is, and I’ll bring his talk up again, because he clarifies that Elastic is “built for speed, scale and relevance”. This is important because it explains the reasons for making decisions like having Elasticsearch be a distributed and horizontally scaling search engine. The way it stores and indexes its data is made to “provide answers to questions involving petabytes of data in milliseconds to seconds”.

Distributed

Being a distributed search engine appears to mean that instead of a client’s query being served by a single machine, and enduring the bottleneck that will occur while parsing through huge amounts of data, the query goes through many machines, sometimes in at the same time (in parallel).

Its a much more complicated concept, in practice, than I could ever do justice in explaining. However, I did find an example of a parallel SQL query [6] done in Solr that helped me understand the concept a little better. FYI, Apache Solr is an open-source search server (like Elasticsearch) that is built on top of the same search engine as Elasticsearch, Apache Lucene. The example is part of the Solr Reference Guide, specifically the Data Table Tier section (link).

I refreshed the original diagram (figure 1) with my own version of how the SQL handler breaks down the requests so that each worker is responsible for a portion of the data that is replicated across five different shards. The original image felt a bit cluttered, I hope it makes translated well enough and makes as much sense as the original:

What is a shard?

Sharding data is a strategy used to horizontally partition (divide) a database’s data into separate units and placing the pieces in different server instances (or nodes) in order to make accessing the data faster and more efficient. Horizontally partitioning the database means dividing the data by rows (vs columns).

For example, in a database table storing customer information, you might have rows of customers’ names and addresses, sorted by an index of the last names, in alphabetical order. You could shard the data into groups by last name by dividing names starting with A-F in shard 1, G-L in shard 2, M-R in shard 3, and S-Z in shard 4. This decreases the size of each shard’s index (reducing search time) and each subset of data by 80% of the original database (each shard is 1/5 the original size).

Another example would be old-school reference encyclopedias whose subject entries are arranged in alphabetically order. These entries are collected into individual volumes. So, if I wanted to find the subject “Kenya”, you wouldn’t have to look through every volume, in order. You skip the volumes that include entries A to J and find the volume marked “K”. Sharding provides a way to index and more quickly access the data you are looking for.

Going back to the Solr example [6], the database is not just sharded, but there are also replicas of each shard. In this case, three copies of the each shard of data.

When a query is sent to the sqlhandler it formulates a strategy (parallel query plan) for using each of its five multiple worker nodes to access 1/5 of the data in each shard by accessing different replicas of the same data. The ordered data (tuples) is processed and returned, in parallel with other nodes, to the sqlhandler, which merges it and returns it to the client with vastly increased speed and efficiency over accessing a single database with a single worker.

Horizontally Scaling

I found that sharding and distributed systems are often associated with horizontal scaling. I’m not yet sure if this is always the case or if it’s just a coincidence.

When we talk about ‘scaling’ it could be about a physical server, or a virtual server instance or a node or cluster which can be a combination of the previous two.

In addition to horizontal scaling, there is vertical scaling (scaling up), and very generally I’d summarize vertical scaling as being the upgrading of each of your server instances (or nodes). For physical servers (bare metal servers) this means adding additional memory, upgrading the motherboard and/or cpu, increasing storage size or upgrading to faster storage devices. For virtual instances, you can allot more resources (cpu, memory and storage) to each server instance.

The problem with vertical scaling is that you hit a limit in how much you can improve a single system. Costs start to rise dramatically as you begin to require very specialized, difficult-to-find hardware. For example, exceptionally high capacity memory chips, huge storage devices and powerful CPUs. You can only increase virtual server instances so much before you need to upgrade the physical machine hosting those virtual servers.

With horizontal scaling (scaling out), the systems used are much more mainstream in their setup, using commonly available CPU, memory and storage components that are easily sourced. When you need additional capacity you add more servers to balance out the load. This can include creating more virtual instances of servers.

As Bill Wilder explains in his excellent overview of Scalability in his book “Cloud Architecture Patterns” [7], the complexity of setting up a horizontally scaling system is somewhat greater than with a vertically scaling one. However, the benefit of investing some upfront cost of implementation is that once the system is setup new servers (or nodes) can be added relatively easily.

Cloud systems like Amazon’s EC2 allow “dynamic scaling” that triggers the instantiation and initialization of new server instances according to a set of scaling policies, which include Target Tracking scaling, Step scaling and Simple scaling [8].

Target Tracking scaling — scales according to a specific metric, such as CPU usage average across all instances. If you have 10 CPUs and the alarm trigger is ‘CPU usage average across all CPUs above 60%’, and this event is triggered, AWS will add additional server(s) so that distributed load falls under the threshold (or whatever response was defined by the administrator). When conditions return to normal, it will remove the server(s).
Step scaling and Simple scaling —Step scaling can be used in addition to Target Tracking, to provide additional customization to alarm responses, but in general, Step and Simple scaling react to monitored metrics exceeding or dropping below defined thresholds. A key difference between Step and Simple scaling, is that Step can have a graded response depending on how severely a threshold was exceeded, and also responds if the condition continues to worsen even as a response is being initiated. The Simple scaling does not and has a “cooling off” period before responding to new changes after its initial response to a threshold alarm.

For example, imagine a viral event (YouTube video, Medium post, Facebook live, etc) and traffic starts to flow to the server, and the system has a Target Tracking policy that monitors the request count per server. When log data shows that requests have moderately exceeded their allotted limits, under a Simple scaling policy this would trigger the addition of (let’s say) another 8 server instances. While the servers are being loaded and turned on, the request counts quintuple and after the 8 additional instances are brought online, the system remains (heavily) overloaded, but must wait for the cooldown to expire before responding again. With a Step scaling policy, the initial response might be the same, but as the requests continue to grow, the Step scaling policy would recognize this and enact the implementation of enough additional servers to bring the monitored metric back down to a level that fits within the defined threshold parameters. That is my general understanding from the docs available [8] 😀.

Resilient

The sharding and replication of data and horizontal scaling of servers increases the overall resiliency of the system. By implementing a system of multiple horizontally-scaling (identically configured) server instances and then sharding and replicating the data strategically, we can greatly increase the ability to a server to maintain 100% uptime and server performance levels that meet our expectations.

If we shard our data and place each shard on different servers, along with alternating replicas of shards from alternate servers, we can ensure we have data redundancy. In the case of using virtual cloud-based servers, if a server goes down, we can instantiate a new instance and clone the data from the copies we have on other nodes:

Thought break

So, 2000 words into this article and I’ve barely even examined Elasticsearch, let alone the other parts of the stack. I’ve practically forgotten their names, but I scrolled up and took a look and they are Beats, Logstash and Kibana.

Learning for me is rarely linear, moving from point A to point B with X amount of time translating into Y amount of learning. Instead it is what I call organic, extending out into the darkness of the soil of knowledge, branching out occasionally to follow an idea or aspect of understanding until it feels beyond scope, before backtracking to a main branch and continuing forward, with an ever growing number of side roads, marked and to be wandered at a future time.

Let’s try and speed this up, shall we? 😊

The last few terms I wanted to look at were:

Search engine
Structured, schema-free (JSON) document-style storage
RESTful API accessible
Built on Apache Lucene

Elasticsearch being a search engine (vs a database server) and being built on Apache Lucene goes back to Dave Erickson’s quote that Elastic is “built for speed, scale and relevance”. It’s good at indexing data and retrieving full-text searches back from that data, but it does not do “joins”, as Dave warns us.

The fact that it’s schema-free means that we are not bound to pre-defining a particular structure for our incoming data beforehand, instead Elasticsearch will make “some educated guessing to infer its type” when its presented with JSON documents [9]. Add on RESTful API accessibility and we have a combination that makes both implementing and interacting with Elasticsearch easy to implement.

What is Elasticsearch used for?

Elastic’s own list of use cases include:

Logs — Easily import and index a ton of log file formats out-of-the-box. After indexing, they can be visualized using Kibana.
Metrics — Track metrics from servers, virtual containers (Docker, Kubernetes, etc) and applications, drill down into data and follow user journeys…
APM — Application performance monitoring keeps track of all the different services that need to work together to make your application work. Elasticsearch facilitates ‘distributed tracing’, allowing you to follow a transaction through the different parts of your infrastructure and see where the bottlenecks are. By integrating machine learning, anomalies within the service structure can be pinpointed and improved, outlier conditions can be identified and explored.
Uptime — Track and monitor uptime performance for hosts, services, websites, and APIs.
Site Search — Build internal search for websites.
App Search — Search any kind of data and integrate with apps. A commonly cited example of this use case is Uber using Elasticsearch to power its app’s drivers-to-user matching using geo search.
Workplace Search — Searching internal company documents (Drive, Salesforce, wikis, etc).
Maps — Visually analyzing geospatial data, adding the “where” to the data provided through your logs. Where are your servers failing, where is website traffic coming from, what are the outlier cases and anomalies?
SIEM — Security Information and Event Management, by funneling log files of interactions with your systems, and utilizing machine learning to identify unusual activity, monitoring for attack patterns, and gaining, as Elastic calls it, a “holistic” overview of your environments.
Endpoint — Related to SIEM, this area focuses on the prevention of specific attacks either to end-users or attacks to systems using malware. Utilizing machine learning to understand what an allowed “action” is and prevent “non-allowed” actions and reporting them.

Kibana

Earlier I described Kibana as being:

A visualization tool for the data imported through Beats (and/or Logstash) and indexed by Elasticsearch.
An administrative tool for the stack.

And this is true, but the one thing I have come to understand about everything I’ve learned about the Elastic stack so far is just how incredibly deep and complex the tools are.

Just take a look at Elastic’s list of Kibana’s features.

In addition to preconfigured dashboards to visualize common data like log files from web servers, and “infrastructure modules” like Docker or Kubernetes, and DBMSes, dashboards can be made to visualize any of the data (and inter-relationships between data) indexed by Elasticsearch.

https://www.elastic.co/guide/en/kibana/6.8/dashboard.html

Permissions can be assigned to limit access to particular dashboards to certain users using RBAC (Role-based access contgrol). Alerting can be implemented to be triggered when incoming data falls outside defined parameters, with notifications being sent via email, Slack, PagerDuty or webhooks. Data can be presented utilizing the Canvas tool, which allows Powerpoint-like presentations, using continually updated (live) data.

In terms of system management, Kibana collects these settings together under “Stack Management”. This section of the UI collects the controls for how data is ingested (through Beats and/or Logstash) and processed. How search indexes are created, refreshed, flushed, aged, snapshot and restored, if necessary. The number and types of alerts and their corresponding actions. The security of the stack, including user management, role management and API keys for RESTful API access.

Beats

Beats is the newest addition to the Elastic stack and is a collection of preprocessors, of sorts, for Logstash. According to Daniel Berman of logz.io [10] the impetus for building the Beats was the desire to offload some of the processing tasks that Logstash once handled on its own. Complicated pipelines of data would create performance issues (bottlenecks) in the aggregation and importation of data.

There are eight prepackaged in the stack:

Filebeat — handles log files, with a system to handle “backpressure” in the case that Logstash becomes overloaded, Filebeat will lower the number of beats it sends, until Logstash catches up.
Packetbeat — handle network packets (e.g. HTTP), capturing traffic, latency and error data.
Metricbeat — metrics like server CPU usage, memory, disk and network statistics for your nodes.
Auditbeat — provides data from auditd, which has kernel level access to what’s going on in your Linux system. It watches system calls, file access and a number of preconfigured audit fields. This is incredibly useful in analyzing “incidents” and security monitoring.
Heartbeat — monitors uptime and response time data. An easy way to quickly implement a solution to make sure your sites are always up and running.
Winlogbeat — provides data from Windows event logs and can forward application, hardware, security and system events to Elasticsearch and Kibana. Again this useful for security and activity analysis and monitoring. A comprehensive list of fields that are exported is available at the Elastic Winlogbeat reference guide.
Journalbeat — forwards data from the systemd journals. “systemd is a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system.” [11] It’s logging capabilities make it a great entry point to see whats going on in a Linux system at all levels.
Functionbeat — is a data streaming pipeline for logs coming from serverless web architectures using FaaS (Function-as-a-Service) platforms.

As I mentioned in the brief summary earlier, there are user-created Beats available through the Community Beats archive which cover specific use cases for example amazonbeat (Amazon products), cloudflarebeat (Cloudflare logs), githubbeat (GitHub repo activity), twitterbeat (specific user tweet tracking) and many more.

Logstash

The last piece of the puzzle and the second part of the “Ingestion” portion of the Elastic Stack’s work flow is Logstash. Whereas Beats will take in data and process each line of a log file and passes it along, which will work just fine in Elasticsearch, Logstash can take that raw data and then further transform it to the needs of the user.

Unstructured data can be turned into structured data, geographic locations can be calculating using IP addresses, fields can be manipulated, removed or recalculated as needed by Logstash.

Logstash comes with a healthy set of plugins to process many types of events directly such as beats events, cloudwatch events, files, google cloud storage, irc, kafka, redis and network data like tcp, snmp, udp, websocket and irc events.

The Logstash pipelines themselves have logging and provide monitoring and analysis of incoming streams of data allowing analysis of choke points and performance over time.

Conclusion

I can’t say my coverage of this topic is by any means comprehensive and there’s a lot of areas where my research barely scratched the surface. However, I feel I’ve gained some understanding of the functionality and use cases for the parts of the Elastic stack and see the role that it could play in the building, maintenance, stability, security and analysis for improvement of a world-class website. The capability of the stack to delve deeply into the deepest recesses of log data and the integration of machine learning to spot anomalies and bottlenecks would be highly useful in continually refining the functionality of a production server.

🦶🎶:

“The Stack That Helped Medium Drive 2.6 Millennia of Reading Time”. Medium.com. 10/23/15. https://medium.engineering/the-stack-that-helped-medium-drive-2-6-millennia-of-reading-time-e56801f7c492. Accessed 7/24/20.
“Keeping Up with the ELK Stack: Elasticsearch, Kibana, Beats, and Logstash”. Amazon AWS Public Sector Summit. June 21, 2019. Video, 5:00. https://youtu.be/DRQJHOOstY0?t=300.
“Elasticsearch”. Knowledge Focus. https://kfocus.co.za/elasticsearch/. Accessed 7/24/20.
“What is Kibana?”. Elastic. https://www.elastic.co/what-is/kibana. Accessed 7/24/20.
“A Beats Tutorial: Getting Started. Daniel Berman. logz.io. 2/25/20. https://logz.io/blog/beats-tutorial/. Accessed 7/24/20.
“How Parallel SQL Queries are Distributed”. Solr Reference Guide 6.6. Apache Software Foundation. 06/09/17. https://lucene.apache.org/solr/guide/6_6/parallel-sql-interface.html#data-table-tier. Accessed 7/26/20.
“Chapter 1: Scalability Primer”. Bill Wilder. O’Reilly Books. https://www.oreilly.com/library/view/cloud-architecture-patterns/9781449357979/ch01.html. Accessed 7/26/20.
“Dynamic Scaling for Amazon EC2 Auto Scaling”. Amazon AWS. https://docs.aws.amazon.com/autoscaling/ec2/userguide/as-scale-based-on-demand.html. Accessed 7/26/20.
“Elasticsearch as a NoSQL Database”. Alex Brasetvik. Elastic. 09/15/13. https://www.elastic.co/blog/found-elasticsearch-as-nosql. Accessed 7/26/20.
“A Beats Tutorial: Getting Started”. Daniel Berman. Logz.io. 02/25/20. https://logz.io/blog/beats-tutorial/. Accessed 7/26/20.
“systemd”. 07/20/20. Archlinux Wiki. https://wiki.archlinux.org/index.php/systemd. Accessed 7/26/20.