Stories by Frederik Banke on Medium

Switching from GlusterFS to Amazon EFS

Frederik Banke — Fri, 19 Apr 2019 06:44:53 GMT

This article was originally published on www.datadriven-investment.com

My current hosting set up uses GlusterFS to create a shared filesystem that all my docker hosts can use. The shared filesystem is used to host the upload folder in my WordPress installations.

The cool thing about GlusterFS is that it is a true cluster file share. It is installed on all nodes and makes all files available everywhere. So if any node crashes it does not affect the others.

But, this level of redundancy comes with a price. It is more difficult to maintain and it increases the needed storage space as the amount of nodes increase.

So I opted for the easy solution and switched to EFS.

You can read more about the hosting setup here. One problem with GlusterFS is that it needs to be installed on the docker host machine. It makes it much more difficult to scale the setup because we must take extra care when booting up a new host.

I would like my docker hosts to be 100% generic because that allows me to switch my setup to run inside ECS.

If we can replace GlusterFS with a generic NFS share, then we can remove this dependency. Lucky for me, Amazon provides their EFSsystem that allows us to create an NFS file share that we can mount directly into our docker containers. Removing all special configuration on the host machine.

EFS does have some drawbacks and should not be used for anything that is performance critical. The files in my NFS share consist of two types.

Media files, uploaded to WordPress
PHP files for the installed plugins

The media files will be cached and served by a CDN so they should only be hit a few timed on the disk.

The PHP files are hit every time we need to generate a page view. Running application code filed from any network attached storage is a bad idea. A few notes here and here. The problem is that NFS is not built to be a low latency file system, making it unsuited to host application files.

Since the files do not change that often we benefit from a read cache. If we install cachefiled it should give us quite a boost. In the future, it would be better to bake the PHP files into the docker image, but that is an exercise for later.

Setting up EFS

EFS is made to be maintenance free. There are almost no options to choose from. It creates one filesystem with “infinite” space, and it can be mounted using standard NFS tools.

A wizard takes you through the whole process as shown below. It is suprisingly simple.

When it is created you can mount it on your Linux machine with the standard mount command.

sudo mount -t nfs4 -o nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport fs-xxx.efs.eu-west-1.amazonaws.com:/ efs

When it is mounted we can copy all files we need access to into the new mount point.

Docker and NFS

The standard local volume driver in docker supports NFS. That makes it easy to configure the volumes. The snippet below is from the docker swarm configuration file. In the bottom, we define the volumes with a connection to NFS, and setting them up on the services is standard.

version: "3.2"

services:
 php:
    image: 637345297332.dkr.ecr.eu-west-1.amazonaws.com/patch-php-fpm:latest
    build: php-fpm

    deploy:
      mode: global

    volumes:
      - wp_core:/var/www/datadriven-investment.com/:ro
      - datadriven_investment:/var/www/datadriven-investment.com/wp-content
  
volumes:
  wp_core:
    driver: local
    driver_opts:
      type: nfs
      o: addr=fs-xxx.efs.eu-west-1.amazonaws.com,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
      device: fs-xxx.efs.eu-west-1.amazonaws.com:/patch_wp-core/_data/

  datadriven_investment:
    driver: local
    driver_opts:
      type: nfs
      o: addr=fs-xxx.efs.eu-west-1.amazonaws.com,nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,noresvport
      device: fs-xxx.efs.eu-west-1.amazonaws.com:/patch_datadriven-investment-data/_data/

Pricing

With GlusterFS an Amazon EBS partition is needed for each host. With EBS the price is per provisioned Gb of storage per month. In my case, 2x 8Gb priced at $1,76 per month.

With EFS the pricing is based on the actual space used. My current space usage is 1,5Gb which cost $0,495 per month. A small saving, but it will be more expensive as my storage requirements grow.

Final remarks

It was suprisingly easy to set up EFS and connect it to Docker — thumbs up for that!

Some additional tips for EFS

Improving scalability in C# using Async and Await

Frederik Banke — Sat, 02 Mar 2019 09:35:31 GMT

Learn how to improve performance using scalability with the async and await keywords. When are asynchronous code able to help us and when will it make our performance worse.

We want good performance in our software, and we can get good performance in two ways. 1. Refactor the software to do its tasks faster. 2. We can improve the software scalability so we can handle more concurrent tasks.

Asynchronous code targets scalability making us use the resources of our servers better.

The inspiration for this article comes from the talk given by
Maarten Balliauw.

What is asynchronous code good for?

Most programs consist of single-threaded synchronous code. That causes the program to pause when it waits for disk or network. Only when the disk or network has responded the program continues processing.

It works great if we have a single task to complete. But, if we have a server that executes many tasks in parallel we need to consider scalability. It is where asynchronous code can help.

Asynchronous programming can in some cases help with performance by parallelizing a task. But, that is not its main benefit in day to day development.

Instead, the main benefit comes from making our code more scalable. The scalability feature of a system relates to how it handles a growing amount of work. And its potential to scale to accommodate that growth.

We can scale in two ways, horizontal and vertical

Horizontal scaling

If our system supports this scaling type, we can add more servers if we need to scale our service. The difficulty with scaling this way is that it has a few key constraints. Each server lives without knowing about the others. Because of that, we can only scale if the service is stateless.

When session data is part of the system, it makes scaling horizontally much harder. It increases the complexity of the solution, and we have a bottleneck with a shared cache or database.

A service build from RESTful principles is easy to scale horizontally.

We can use horizontal scaling no matter if we have synchronous or asynchronous code. There is no difference.

Vertical scaling

Another way of scaling is to improve how many requests our server can handle. Every single request will most likely not perform any better. But, we can increase the number of concurrent requests the server can handle.

When we scale vertically, we are not limited by the properties of RESTful because we are on a single server. Often vertical and horizontal scaling is used together. It makes REST a list of sound principles to follow.

How does async work?

We go through a few examples, and explain how both synchronous and asynchronous code work. From it arrives at the server to it sends its reply back.

Synchronous requests

Inside our C# application, no matter if it is a .NET Core or .NET application, is a thread pool. When a request hits the service, a thread is drawn from the thread pool.

This thread handles all processing needed until the response is sent. If we need to make a slow database query as part of the request, the thread pauses. Nothing happens before the database query is complete.

Let us assume this scenario; we need to make a database query that takes a long time to complete. As long as we have available threads, we can allocate a new thread for every incoming request. But as soon as we hit the max amount of threads, we rejects incoming requests.

Most of the threads are waiting for the database query to complete. And we end up with a server that is not doing much, except waiting but is rejecting requests. Not a good situation, and a bad underutilization of resources.

Asynchronous requests

With asynchronous code, we can improve the situation quite a bit.

A request starts in the same way by fetching a thread from the thread pool. But now we have marked our database query method as async. It instructs the .NET runtime that we expect the method call to take a while to complete. Now the thread is freed and put back in the thread pool.

When the database query completes, a thread is allocated from the thread pool. This thread allows us to continue the request.

Now we only have threads allocated that have work to do. When we need to wait for IO, we allow the threads to continue with other tasks. We are not guaranteed to get the same thread back. But that is not a problem since the .NET runtime handles the details.

Using asynchronous code does not give an increased performance on a development machine. The reason is that there is not enough load to see the overall benefit. But for a production environment, it can process more requests.

Is asynchronous code worth it?

Later we go through how easy it is to implement. But first, let us look at some numbers and see if it can make sense to use asynchronous code in the first place.

I took this example from this article. Let us assume we have a 16 core server and a thread pool with 1 thread per core, 16 threads. If we create 1000 requests to the service, the 16 threads are able to handle all 1000 requests.

Let us assume that some lousy programmer put a Sleep(200) as the first code on each request. We would assume that each request would be 200 ms slower, but actually, it is much worse!

The first 16 threads block for 200 ms while processing the request. When they complete their requests the next 16 requests are handled and blocks for another 200 ms. Now the average response time goes up to 6 seconds for 1000 requests.

But the resource allocation on the server is low, not indicating any problem. A small amount of blocking code can cause huge problems with throughput.

Since a thread needs an actual CPU to run we can not increase the number of threads to thousands or millions. It causes too much congestion on the CPU to be useful.

Generally, you want less than 1000 of them(threads), and preferably < 100

Diagnosing .NET Core ThreadPool Starvation

Asynchronous code is not just worth it; it is essential for high performing services!

Asynchronous coding

Now to the exciting part, how do we implement asynchronous code in our service?

C# comes with two keywords built in “async” and “await.” They go hand in hand to allow us to easily implement asynchronous code.

When we mark a method with async, it gives two things. We can use await inside it and it transforms the method into a compiler generated state machine.

Notice the async keyword does not make anything async. That is the purpose of the await keyword. Await tells the compiler that the async method cannot continue. Before the awaited process has finished.

We can only await async methods. If an async method does not have any await inside it is run as a normal synchronized method.

In this example, the call to the filesystem goes through the method “GetDataAsync”. The async and await keyword allows the thread to return to the thread pool. When the call returns, a new thread is allocated to continue processing.

In the code example, the return type is a Task instead of string. Async methods can return either Task or Task. A Task represents the execution of the method and is a state machine.

The state machine can be in a few different states. IsCanceled, IsCompleted, and IsFaulty which are accessed through the field Status.

It is the state machine that allows the .NET runtime to make it easy for us to implement asynchronous code. For more details, I recommend looking at C# In a Nutshell, chapter 14.

We can use the async return types Task and Task, without the async keyword but then it runs sync. It is bad practice because it reveals the implementation details from our method.

Asynhronous in the bigger perspective

The example above shows how to apply async to a single method, but that does not show us how to use it in a larger system.

We must apply async through the layers. Most applications pass a request through different layers to build the response. We need to have async where the IO happens, so we should start bottom up.

Let us set up an example. We have an ASP.NET controller with an action to get a user based on its user id. The controller passes the request down through a service layer. In turn, it requests the user from a repository that fetches it from a file on disk.

It is the last part, fetching from disk, that causes IO that we want to have async. Let us see some code.

The IO is in the GetData method. It is where we want to add async code. We create the GetDataAsync method. Add the async keyword and changes the return type to Task. We also need to change it to read the file using a StreamReader so we can await it.

In the repository code, we also need to make changes to consume the new async method.

In the repository, we can get away with making the method async. Also, add the await keyword on the GetDataAsync method call.

We continue up the chain to the service layer and add the async code.

Finally, we add async to the controller.

Now we have added async to the whole stack down to the actual IO. It allows the .NET runtime to release the thread to the thread pool while we wait for the disk to fetch data.

We should not mix synchronous and asynchronous code without careful considerations. We should strive for async all the way.

We could use “.Result” or “.Wait” on the task object to switch from synchronous to asynchronous code. It is a bad idea since it causes the thread to block until the asynchronous code returns. It is defeating the whole purpose of adding async in the first place.

Which work should we use async for?

Async is not the solution to all thread blocking code. Threads block if we call a method that performs some heavy computation or if it needs to wait for IO. It is only in the case of IO that we get any gain from using async.

IO bound

Any task that accesses the file system or any external service is good candidates for async.

Computational bound

Any algorithm that takes a long time and is CPU or memory intensive are bad candidates for async.

Examples

Adding an entity using entityframework: IO happens when we call save. Not when we initially add the entity. It is not a good candidate for using async. Entityframework supplies us with an async add method, but we gain nothing from using it.

Entityframeworks SaveChangesAsync is a good candidate for async. It does IO to the database.

Integrating async with legacy code

What if we need to integrate our code with code that is not async aware and we can not change it? The old code is directly awaitable. But we can cheat and make it awaitable by creating a task. Some extra discussion from stack overflow.

As discussed before, it is not optimal to create a task in synchronous code. It causes the main thread to block until the task returns. But the code is still correct, just not as scalable. It can be done as shown in this example:

Closing remarks

I hope this overview into asynchronous code gives you some ideas to go try it in your own code.

Happy coding!

Book Review: Domain Driven Design

Frederik Banke — Fri, 01 Feb 2019 08:29:16 GMT

510 pages, published in 2004. The book is quite old for a tech-book but it has aged exceptionally well. It only shows age in a few unimportant areas in my opinion.

The concepts presented in the book have revolutionized me as a developer! Read it multiple times and with great attention to internalize all the knowledge it contains, it will be worth it. To really understand the concepts and integrate them you should also build a project as you read the book where you implement the examples.

I cannot recommend this book enough! Below is my notes from reading the book, it is partly a summary of each chapter and partly my own thoughts about the content.

Chapter 1–3

Ubiquitous language is the most important concept when running a project as domain driven. It is essential to facilitate that developers talk to the domain experts. A way to do this is to build prototypes for the domain experts to evaluate.

The core learning is to create a forum where the developers and domain experts can learn from each other while collaborating to creating the ubiquitous language.

Developers are good at creating software, but the software can only be as good as the understanding of the problem area. If the domain is complex, in most cases, the domain experts have implicit knowledge that they can not easily communicate, but by creating simple prototypes and discussing the domain model, it should be possible to bridge this gap.

It is essential to keep the ubiquitous language in sync with the domain model. If one change, the other should also change. A simple example of this: if we agree on a name of a concept we naturally start using this name in the code for class names and variables. If we later find a name that describes the concept better the code must change to reflect this. If we do not change the code, any developer looking at it later might not connect the erroneous name with the domain concept.

We can not ask the domain experts to tell us what they want the software to do. It will often lead to either too detailed technical descriptions or to abstract concepts.

Starting up the discussions can be difficult because progress in the start will be slow until the participants get tuned in to each other.

The initial domain model, which might take several meetings to flesh out should result in a prototype. We prototype because it allows us to see if we as developers understand the domain correctly. If we are able to connect the details in the domain in a program that outputs something useful. This process is iterative.

Chapter 4, Isolating the Domain

A system consists of many responsibilities: UI, database, business logic and so on. It is important to isolate the domain-related code from other concerns. If the business logic is diffused throughout the system then it eventually becomes harder and harder to change anything.

One way to do this is to apply layers, this is standard practice, and makes the domain objects clutter free. It is important not to get caught in the trap of depending on a framework unless it is needed. It can become to heavyweight and make the domain difficult to understand.

Not all projects benefit from a domain driven approach. In some cases what you need is a “Smart UI.” If the business rules are all simple and there is no future need for supporting complex business rules, then the overhead of the domain-driven approach might be too high. In this case, you could put all the logic into the UI without much worry. Remember that the approach should fit the task.

Chapter 5, A Model Expressed In Software

A domain model consists of the basic elements ENTITIES, VALUE OBJECTS, SERVICES, and their associations.

In real life, there are lots of many-to-many relations. However, they are not always helpful or easy to work with inside software. Often there are ways to communicate the intent of the association better by adding constraints.

For example, the relationship between country and president, is bidirectional one-to-many. But one of the directions is more significant because we most often start with the country, and not with the president.

We can qualify this even more by adding a constraint like period on the association so we can ask, “who was the president of the USA in 1998”.

We can make associations easier to work with by adding different constraints like: Imposing a traversal direction, adding a qualifier to reduce multiplicity or eliminate associations that is not strictly needed.

Entities: An entity is defined not by its attributes but by a thread of continuity and identity. Even if you have the same name and age as another person you still have unique identities.

An identity is the same across multiple systems, for example with medical records, when changing hospital you still need to be identifiable using some unique identity. Even if the same identifier is not used across systems, the entity is still the same. For example, a transaction using cash is still the same transaction when you hand the cashier the money and when it is debited to the store’s account.

Value Objects: These objects describe some characteristica of a thing. A value object is much simpler than an entity. If only the attributes of an element are important and no continuity is needed, it is a value object.

In some cases, a concept is an entity, and in others, it is a value object. It depends on context. For example, an address, if you order something to be delivered, it would not matter if your roommate also ordered something at the same time. The delivery service could treat the address as a value object.

But if you ordered electrical service, then the service company needs to be aware if your roommate also ordered to the same address. In this case, the address is an entity.

Services: Sometimes, it just isn’t a thing. Some important operations are not natural to place in either entities or value objects. They are activities but we must fit them into objects anyway.

If we fit the action into a wrong object the object loses clarity and becomes hard to understand. Often actions also depend on many different objects so the dependency graph becomes difficult to understand.

A service should be added as a standalone interface declared as a service. Also, make the service stateless. The stateless behavior makes the service not depend on any state which makes it behavior simpler.

Services exist in different layers, not only in the domain layer. There are also services that exist in the infrastructure and application layer. The distinction is not always clear, and it takes care to factor the responsibilities correctly.

If we take the example from the chapter: A bank has an application that sends an email to a customer if the account balance falls below a threshold.

Infrastructure layer services: This type is the easies to explain. They are purely technical like email or sms services. They are used by the domain and application services to implement the actual sending of messages.

Application layer services: It is the reponsibility of the application service to order the notification. The domain layer is responsible for knowing when the treshold was met.

To quote the book: Many domain or application services are built on top of the populations of entities and values, behaving like scripts that organize the potential of the domain to actually get something done.

Chapter 6, The Life Cycle of a Domain Object

All objects in a system have a life cycle, they are created, changes states and gets removed. Some objects are simple, others are complex. We need to take special care of the complex object. The challenges are:

Maintain integrity throughout the life cycle
Prevent the model from getting swamped by the complexity of managing the life cycle

It will be managed by applying three patterns, aggregates, factories and repositories.

Aggregates: Most business models contain many complex relations between objects. It creates fuzzy boundaries where we can end up with a huge interconnected object hierarchy. It can lead to problems with maintaining consistency. To avoid this, we must promote some of the objects to aggregates for a group of objects. Only inside the aggregate, we need to maintain consistency.

Outside the aggregate, every reference only points to the aggregate root. And all access to the other objects is accessed through the aggregate root.

Those rules make it possible to maintain consistency inside the aggregate because access is controlled. More thoughts on aggregates.

Factories: Creating an object can be complex. We do not want to have this complexity encapsulated inside the constructor of an object. The example in the book is: we need a car factory to build a car. A car does not need to know how to build itself.

The construction logic is separated out, into a separate factory. Additionally, we want to create a factory interface that the domain depends on, the concrete implementation is abstracted away.

A factory is a good choice to create an aggregate. But what if we need to create an object of a class inside an aggregate?

One option is to add a factory method on the aggregate root, for example, if the aggregate root is a Purchase Order, it could have the method newItem() to create a Purchase Item object and add it to the aggregate. This will work fine if the new object is part of the aggregate.

If the object is not part of the aggregate it needs to be an aggregate root which should have a factory for constructing.

When designing an interface for a factory keep this in mind:

Each operation must be atomic
The factory will be coupled with its arguments

Repositories: To do anything with an object, a reference to it is needed. This reference is obtained by traversing an association in the domain model. However, we do not always want to add associations since it muddles the model. A better way is to use a repository to recreate a domain object from persistence.

To access aggregates we often need to search them based on object attributes. From the aggregates, we can traverse the associations to get to other objects. We do not want to allow free database queries since that can break the constraints imposed by the domain model.

A repository implements an illusion of an in-memory collection of all objects of that type. Only provide repositories for the aggregates that need direct client access, to keep the clarity of the model.

If complex access is needed to search for objects, a specification pattern can be implemented in the domain language.

Chapter 7, Using the Language: An Extended Example:

In this chapter an example of a system is discussed. The system implements a small domain model for handling shipments of cargo.

The first point is; to use a system, many user-level application functions must be defined. One of the functions is a tracking query to allow us to see the handling events of a particular cargo. The application classes are coordinators. They must not work out the answers to the questions they ask. That is the responsibility of the domain layer.

It is not always necessary to have a repository for an aggregate root. If there is no application requirement for looking up an aggregate and it is only referenced from other entities, then the repository is not needed.

When our domain grows we must take care to group the domain objects into modules. Do not fall into the trap of grouping all entities into a module and all values objects into another module. Instead, the modules should reflect meaning like “Customer”, “Shipping” and so on. Each module should contain the entities and values that have a relation to each other.

In the example an external system needs to be integrated into the existing application. To make sure the external system do not corrupt the existing model an anti-corruption layer is added so shield the model.

In the example the external system holds information about how much may be booked of a particular cargo type. To handle the translation between our system and the external system an “Allocation Checker” service is added. This service acts as the anti-corruption layer.

Chapter 8, Breakthrough

When the model is continiously refined we will only see small improvements until we reach a ureka moment where a refinement makes the model much more expressive and sends a chock through the project.

It is the small increments of improvement that paves the way for the big improvements. The model becomes a deep model that expresses a deep understanding of the domain. It should allow the communication between the technical and business staff to improve because the model gives a shared language that improves understanding across the project.

The chapter goes through a story about how a breakthrough happened in one of the projects the author worked on. A new understanding of the domain often requires a large and scary refactoring to make the required changes. It requires courage to do the needed changes.

In many cases a breakthrough will clarify the model to allow other problems to be visible. This causes a cascade of improvements that makes the model much deeper.

Chapter 9, Making Implicit Concepts Explicit

How do we make a deep model? The power in the deep model is that it allows us to express the knowledge, activities, problems, and solutions in the domain. Flexibly and succinctly.

In the beginning, the developers are novices in the domain so how do we extract the deep model from the business people? It is going to happen gradually, and the chapter presents ways to extract knowledge.

First the concepts must be found. It can happen in different ways, by listening to the language of the team. If the domain experts use terms that express a concept in the domain, but it is not part of the domain model, then it is a hint that the concept needs to be added to the model.

Listen to conversations: In other cases the missing concepts are not part of the conversations. Then you must dig and invent in the most awkward place in the model. In the place where every new requirement adds complexity. With help from the business partner, this area can be understood better, and the model improved.

Contradictions: In some cases domain experts see things in different ways. Sometimes even contradictionary. This often indicate an area where a deeper model can be achived.

Read the book: If there is already litterature on the subject discussing terminology and fundamental wisdom, you may be able to start with a deeply considered view.

Explicit constraints: When doing object-oriented modeling, constraints on objects often come up. For example, a class implementing a bucket which has a limited capacity is a constraint. The constraints are often implicit but making them explicit can improve the model. The constraint can either be factored into a separate method in the class or into its own class entirely.

Processes as Domain Objects: Procedures should not be a prominent part of the model. But some processes encapsulate business meaning and can be implemented as a service. The way to decide is if the process is something domain experts talk about or not.

Chapter 10, Supple Design

Software must serve users, but first, it must serve developers. Developers refactor, extend and build on the software. As times go on and the software is in maintenance mode developers will still change the code.

If the software lacks good design it is increasingly difficult to change. As soon as developers are not confident in the changes they make, duplications appear. Developers will be unhappy working in the codebase. This effect is often seen in projects where over time the amount of effort to make new features or fix bugs increase until the projects reach a standstill.

The way to make sure software is changeable is to make the design supple, and it complements deep modeling. Making a design supple is an iterative process where refactorings are tried out, implicit concepts are made explicit. In this chapter the author covers which experiments to do and how to gain a better understanding of the architecture, we want to code towards.

A developer has two roles when programming, in one role he is a client that uses the domain objects in the application code. The domain should make it easy to express the scenarios needed in the application, and the elements should fit together naturally. In the other role, the developer works to change the domain. It requires the design to be open to change, and consequences of a change must be easy to understand.

The rest of the chapter goes through a series of patterns to use to arrive at a supple design.

Intention-Revealing interfaces: it must never be a requirement for a developer to know the implementation of a component to use it. It that level of knowledge is needed we lose the value of encapsulation. It goes a long way to name classes and operations in a way that describe their effect and purpose. It helps to use Test Driven Development to make sure the intent of the code is clear.

The example from the chapter is a simple Paint class. The class has a method named “paint” but that does not reveal what the method do. Instead it should be renamed to “mixIn” to show that the method mixes paints together.

Side-Effect-Free functions: Operations can be divided into two types, commands and queries. Commands modify the state of the system and queries only reads information.

A side effect happens when a method is called and some state changes. As we combine multiple methods calls in arbitrary depths, it gets tough to reason about which side effects are triggered. It limits the level of richness the developer can express.

A function is a method that we know does not produce a side effect. In any system there should be as many functions as possible. Strictly segregate commands into very simple operations that do not return domain information.

The example in the chapter deals with two objects each representing a volume and color of paint. If we use the mixIn method from above what will happen with the two objects? Should the state of paint 1 change to the new combined color and volume of the two paints? What about the volume of paint 2, should it become 0 after the mix?

To avoid this problem the paint objects can be immutable value objects such that a mixIn returns a third paint object causing the two other paints to have no change. It is a case that is much easier to reason about.

Assertions: There will still be a collection of methods on the entities that produce side effects. Assertions make side effects explicit and easier to deal with.

When calling a method that delegates the work to other methods the only way to know which side effects is being triggered is to trace the branches through the program. It breaks encapsulation. It also depends on the implementations because interfaces do not enforce anything about side effects breaking abstractions.

The concept is taken into the programming languages in the “design by contract” school. In C# there exist Spec Sharp but there is not standard language support for the concept.

The example from the chapter refactors the paint concept further to only have a single method with a side effect. They defer the responsibility for the assertions to the test code. In my opinion, this forces a developer using the component to read the tests to get the correct understanding of the component. It is a bit unelegant I think. However, to make it better language support is needed.

Conceptual Contours: This pattern deals with effective decomposition. There are two extremes when dealing with decomposition. In one case the concepts are grouped in a large monolith. This forces duplication because it is impossible to reuse parts of the monolith and it is difficult to understand. In the other case, if the decomposition is too fine-grained, it forces the client to understand how the tiny pieces fit together.

We should try to find the deep consistency in the domain and group design elements into cohesive units. This grouping is based on intuition about the domain. The resulting interfaces should logically make sense in the ubiquitous language.

The contours often only show up after a lot of refactoring towards deeper insights.

Standalone Classes: All dependencies in a class makes it more difficult to reason about. The more dependencies there are, the more difficult testing will be. The difficulty will increase exponentionally.

Low coupling is the fundamental design to strive for. If all dependencies are removed the class can be understod by it self. This eases the cognitive load when understanding a module.

Closure of Operation: In the pattern above we risk dumbing down our model because we remove dependencies to a point where it gets less expressive. This pattern deals with how we manage the dependencies.

When it fits we should define return types as the same type as the arguments. I think this is what is also known as a fluent interface.

It can benefit us to create a more declarative style of design. As with everything, this can be taken to extremes like model-driven design, where code generation tools generate the actual code. That approach is not always flexible enough. The chapter elaborates on how to change the specification pattern to be more supple and declarative. It shows how to extend the specification to use and/or/not operators to combine specifications.

Chapter 11, Applying Analysis Patterns

As a software developer, you know about design patterns. Their purpose is to solve a low-level technical problem in a proven way.

Analysis patterns are in a way the same except that it focuses on the domain instead. It makes the patterns both more domain-specific, but at the same time, it allows us to capture more high-level concepts. If there already exists an analysis pattern for what we are trying to achieve it can help produce a good solution faster because we can avoid some of the refactoring steps.

The only resource the chapter points to are Analysis Patterns by Martin Fowler, but it is quite an older book. There does not seem to be written that much on this topic as far as I have been able to find.

Chapter 12, Relating Design Patterns to the Model

Design patterns are well known in the software development literature. However, the focus in design patterns is mostly technical. However, some of the patterns can be used in a domain model as well. Our thinking needs to be a bit different. The author shows two examples using the Strategy Pattern and the Composite Pattern.

Strategy (policy): A domain model contains processes that are meaningful for the domain and not technically motivated. In many business domains, there is a need to have different processes to solve to the same problem. Even though the strategy pattern is technically motivated, it fits exactly this purpose, to separate the different part of a process into different policies.

The chapter example is about finding routes for package delivery. We could have the policy to save money for each leg of the journey or look for the fastest route. In this case, the strategy is not just implemented for a technical reason but just as much for domain value.

Composite: In many domain models we end up modeling concepts that consist of parts that we can arrange arbitrarily, into a tree structure. If we do not see this and start implementing each part as a unique concept instead of seeing it as being composable. Then we will end up with duplication, and it will prohibit the flexibility of the model.

The example in the chapter is routes made of routes. A route is a complex concept that consists of routes with legs. Since each “sub-route” can be planned and managed by different persons, it needs to be a concept on its own. It is where the composite shines.

Chapter 13, Refactoring Toward Deeper Insight

There are three major points to consider to gain deep insight

Live in the domain
Keep looking at things in a different way
Maintain an unbroken dialog with domain experts

In this chapter the author walksthrough different aspects on how to improve the domain model.

Initiation: The first step to getting a better model is to see the problem. It might show up in awkward parts of the code because of a missing concept. It might be that the language in the model is disconnected from the language the domain experts use. However, when the problem is located the model can be refactored.

Exploration Teams: If we already have an idea for a refinement of the model we can refactor the code directly. But in some cases the search for a new model is more involved and require more time and involvement from the team.

A team of a few developers and a domain expert get together to sketch a new model, in a conference room for a ½ hour to 1½ hour. It should result in a rough idea for a model. The team might need to sleep on it and get together again to reach useful conclusions. The key points to get it to work are:

Self-determination, a small team can be assembled on the fly to explore design problems. There should only be a need for a shortlived team.
Scope and sleep, Two or three short meetings spread out across a few days should give a model that is worth trying out.
Exercising the ubiquitous language, use the language and refine its use in brainstorm sessions with the rest of the team and a domain expert.

Prior Art: Use knowledge from books about the domain. If there are analysis patterns, use them. Design patterns can often also be used to model concepts in the domain.

A Design for Developers: We develop software for users, but it must also be developed for the developers. The code is going to change again and again when it is refactored towards deeper insights. When a design is supple, it is easy to see the intent and easy to anticipate what will happen if it is changed. That is what makes the design work for the developers.

Timing: If you wait until you can make a complete justification for a change, you have waited too long. The more changes are postponed, the more costly they are to change and harder. Most teams are too cautious about refactoring. It is “easy” to see that refactoring is going to be expensive to implement. However, the cost of working around a bad design is often higher than the refactoring cost. It is just more difficult to see.

Refactor when:

The design is mismatched with the team’s understanding of the domain
Important concepts are implicit in the design
There is an oppertunity to make important parts more supple

Crisis as Oppertunity: Notice that when reading about refactoring it seems like a slow and incremental process. Often it is not like this, refactorings lay a groundwork for sudden insight that reveils something in the domain that reads to a sprout of refactorings towards deep insight.

Chapter 14, Maintaining Model Integrity

This chapter and the remainder of the book goes into strategic design, so it is more high level. I think the advice is more suited for larger projects with more people. However, many of the points make good sense to implement when starting a new project.

A model must be logically consistent to make sense. When having a system that must span a large domain, it would be ideal to have a single model. However, that becomes difficult/impossible to do as the model grows. The example from the chapter is two teams working on the same system. One team have implemented an object named Charge. When the other team started implementing a billing module, they needed a Charge object as well and to reuse code, and they reused the implementation already in the model. However, their changes created bugs in the original implementation.

They ended up splitting into two models. It is often to costly or not feasible to create a unified model. Some of the risks are

Too many legacy replacements may be attempted at once
Large projects may bog down because the coordination overhead exceds their abilities
Application with specialized requirements may have to use models that don’t fully satisfy their needs, forcing them to put behavior elsewhere.
Conversely, attempting to satisfy everyone with a single model may lead to complex options that make the model difficult to use.

The remainder of the chapter provides patterns for how we can create boundaries and communicate relationships between different models. The list of patterns is a bit long.

Bounded context: When building a large software system, even on a single team, different models emerge if we are not careful. It is easy to see when integrating with an external system that the two systems have different conceptual models. However, different models might also arise in the same codebase. It could be that some parts of the code reflect an older understanding of the code. When combining the models’ errors happen, making the system less reliable and more difficult to understand.

A model applies in a context, so we should start by explicitly defining the context the model applies. Teams working in the same context needs to communicate a lot to create a shared understanding. If the teams only communicate once in a while, they are not working in the same context, and the model will fragment.

When a bounded context is defined it creates two advantages, the team(s) working inside the context knows that they must keep the model consistent. Also, any team working outside will use a translation layer to communicate with the model, giving them much more freedom.

When fractures in the model happen the remainder of the patterns give advice on what to do.

Continuous Integration: In a bounded context it is not always possible to break it down into smaller contexts because it loses valuable information and options. Multiple problems can challenge the model. If a developer is not aware of a concept in the model and builds a similar concept, we end up with duplication and the problem with two concepts that diverge. A similar problem happens if a developer is overcautious and knows the concept in the model but because of the risk of change duplicates instead.

Advice for having better control and more confidence in the code are

Reproducible builds
Automated tests
Rules for integration of code changes
Use the ubiquitous language in the team to facilitate communication

Continuous integration is only used inside a bounded context, there is not need for it across multiple contexts.

Context Map: With multiple bounded contexts the need to interconnect them will occour. When that happens it is important to have a model for that to avoid the teams to start blending the boundaries between the contexts. A context map is overlapping project management and software design.

Team members that sit together will naturally start sharing a bounded context. However, be aware of team members that sit in other locations, it will require extra integration efforts to share the same context. In many cases, a small diagram with the names of the different bounded contexts is enough to make the distinction visible to the developers.

Contact points between bounded contexts are important to test because tests can alert about errors before they become a problem.

Shared kernel: In some cases there is significant value in reusing the same parts of the model across multiple bounded contexts. However, continuous integration is too expensive, in this case, a shared kernel can be defined, containing a subset of the domain model. The subset is shared, and coordination must happen for any change in this part of the code.

Customer/Supplier Development teams: In many systems there are subsystems which receives data from our system. The subsystem might also be built using another language making code sharing impossible and it might also server another user group. Naturally, the systems are in different bounded contexts. There is a delicate balance between the systems. If the subsystem developers do not want to implement the changes we request, our system might be impossible to develop. In the same way, if we have veto power to stop changes in the subsystem, it will make the subsystem difficult to develop.

The differences can be accommodated by making all teams part of planning to make sure priorities are aligned. Developing automated acceptance tests also goes a long way to make sure the interactions between the systems keep working.

It is crusial to make sure that the teams can coorporate, if not the relationshop can break down.

Confirmist: If the customer/supplier relationship can not be established the team might need to shift to be a conformist. We have two options if integration to the other system is needed. Either set up an anticorruption layer that can absorb any changed the other team makes. Alternatively, if the models are largely compatible, it can be used directly, and our system conforms to the model of the other system.

Anticorruption Layer: When we are tasked to integrate with a legacy system or other external system the often have their own model. It is important that our domain model is as expressive as possible, so we do not want the model of the external systems to leak through and “pollute” our domain model.

To make sure leaks do not happen an isolation layer is created. The purpose of the layer is to contain any translations needed to communicate with the external systems and do any semantic translations to our domain model. If the layer is strict, it should allow us to develop our domain model without worrying about the semantics and model of external systems.

The public interface of the anticorruption layer usually is a list of services with the occasional entity. Such a new layer allows us to re-abstract the behavior and model of the other system in a consistent model. The layer itself is often built from some facades, adaptors, and translators.

Separate Ways: Integrating systems are expensive. Integrations are not always needed. If we can manage with a hyperlink in a UI to an external system, that is a much cheaper solution.

Open Host Service: Each bounded context need a translation layer for each component it needs to communicate with outside the context. But if the component needs to be used by many others it might be cumbersome to create customer translators.

In this case it might make sense to create a protocol that exposes a list of services. Like a REST api service or similar.

Published Language: Translating between two bounded contexts require a common language. This language can become complex and hard to document. If businesses need to exchange data they probably do not want to conform to the language of the other party.

In some domains a language is developed to support a common language. Examples of it is BIAN, CML, and many XML schemas. If a language already exist it is worth researching if it can be used.

The rest of the chapter is dedicated to points on how to move the project between the different patterns, from Separate ways to shared kernel to continuous integration and so on.

Chapter 15, Distillation

If the domain gets large it become difficult to manage. By distilling the domain, core concepts can be communicated more clearly. As we refactor towards deeper insight the model gets more clear, but how do we manage it when the domain is large. We want all team members to see the overall design and how it fits together. The model should facilitate communication by having a core model of a manageable size to allow new team members to use the ubiquitous language. Distillation should guide refactoring and focus work on areas of the model that gives the most value.

The chapter contains a list of patterns to help us reach those goals.

Core domain: In a large system there are going to be many contributing components. However, many components will obscure the essence of the domain model. If the system is hard to understand it is hard to change. Not all parts of the design are refined equally. The critical core should be sleek and fully leveraged to create functionality.

The core should be easily distinguishable from the rest of the domain model. Also, it should be small. Make the best developers work on the core to make it pristine. It must be refactored to be a deep model and be supple at the same time. We should focus on investments in other parts of the system based on how the other parts support the distilled core.

The remaining patterns help us make the core easier to see, use and change.

Selecting which part of the domain model to include in the core is not an easy task. Even if a concept is central to the model, like generic money, it might not be important enough to include in the core unless it is a money trading application we are building.

Generic subdomains: Some parts of the model might not capture or communicate any specialized knowledge. Those parts should not be part of the core domain, general concepts that all knows and plays a supporting role will pollute the core model.

Identify cohesive subdomain that is not the primary motivation of the software. Separate them into separate modules that do not reference any of the specialties. After that, they can be developed independently of the core domain and with a lower priority. This separation also allows us to consider different options for developing the subdomain.

Off-the-shelf-component
Published design or model
Outsourced implementation
In-house implementation

Each option have different advantages and disadvantages explained in the chapter.

Domain vision statement: When the project is started there is no model but we still need a way to focus effort. Later in the project we need to communicate the value of the system without an in-depth study.

To facilitate those needs a vision statement can be created, this is a description about one page long describing the core domain and its value proposition. It should be written at the beginning of the project and revised as new insights are found.

Highlighted core: The vision statement creates an overview of the core domain. But it is still up to individual interpertation. If the team does not have exceptional communication skills it will not have much impact.

To highlight the core structural changes needs to be made in the code. However, that is not always practical to do, and it often requires the overview that is lacking. So a lighter approach is to highlight the core. One way is to create a distillation document that contains a list of essential objects and maybe a few diagrams. It is not supposed to be a complete design document. Max three to seven pages.

Another way to highlight the core is to create a flagged core. In the documents that show the design the essential elements are flagged, this can be as primitive as post-it notes in a printed design document. It allows developers to navigate the core more easily.

Cohesive mechanisms: In OOP we want to separate the “how” from the “what” in our algorithms to hide complexity. However, we sometimes hit limits in this approach. When our code starts to be bloated and difficult to read because we have too much “how” in the code, we need a different approach.

Create a lightweight framework which uses intention revealing interfaces to hide the “how”, now the domain code can be made more clear.

Segregated Core: When there are elements in the model that serve both the core domain and supporting roles the core might be coupled to generic concepts. This causes clutter that makes the model less clear.

The code should be refactored to separate the core concepts from the supporting concepts. This should strengthen cohesion in the code while reducing coupling to other code.

The steps are usually:

Identify the core domain
Move related classes to a new module, named for the concepts that relate them
Refactor code to remove the connections to data and concepts that are not directly related. Scrub the core domain to make it self explanatory
Refactor the newly segregated core module to make it simpler and more communicative.
Repeat with another core subdomain until the segregated core is complete

Chapter 16, Large-Scale Structure

When systems grow very large, even breaking ít up into modules may not be enough. In some cases, the system grows to be too complex and contain so many modules that the amount of modules alone causes problems of understanding the system. In this case, the developers are not able to see the forest for the trees.

A patten of rules or roles and relationships that span the entire system must be created. It should allow understanding each part’s place in the whole, without detailed knowledge of the parts them selves.

Evolving order: If there are no constraints on the design of a system it will evolve into a system that nobody understands and is difficult to maintain. However, if we impose strict design constraints and up-front assumptions, it can hinder the development of the system because it limits modeling power. It causes the developers to dumb down the system to fit the structure, or not have a structure at all.

A large scale structure should be applied when we find a structure that greatly clarifies the model in the system. An ill-fitting structure is worse than no structure at all, so aim for a minimal solution.

Responsibility layers: In a large structure, if each individual object has handcrafted responsibilities there are not enough structure and guidelines that make it possible to handle whole parts of the domain together. Impose some structure on the responsibilities to make it easier to handle.

We should refactor the model, so each domain object, aggregate, and module fit into the responsibility of one layer. Layers should communicate the realities or priorities of the domain. It is primarily a business modeling decision how to structure the layers.

“Upper” layers should make sense on the backdrop of the lower levels, and the “lower” levels should stand alone.

Different examples of layers that fit many different types of domains are suggested in the chapter. The layers are business oriented, so according to the author the layers presented will fit almost all domains.

Knowledge level: The more generic a domain needs to be, for example if we need objects to interact based on rules that are changeable by users, the more complex the system gets.

It could be a CRM system where each installation is customed for specific customer needs by configuration. In this case, we need a distinct set of object that describes and constrain the structure and behavior of the basic model. A “meta-level” of sorts.

Pluggable component framework: With very mature models that is deep and distilled, opportunities arise. With multiple systems that need to interoperate, but are based on the same abstractions and designed independently. We need many translations between the systems, and a shared kernel is not feasible because the teams do not work closely together.

In this case we could distill an abstract core that contain interfaces and create a framework that allows multiple implementations that can be substituted easily.

The example in the chapter is Sematech CIM framework, which is a framework for industrial machines for semiconductor manufacturing. The software of each machine must adhere to the interfaces designed in the CIM framework, but when it does that, it is freely interchangeable.

Chapter 17, Bringing the Strategy Together

The three driving principles, context, distillation, and large scale structure are complementary principles. I see many good points in this chapter but since I am not working on any projects that are that large I have not invested much time in the chapter.

Conclusion

It is a great book, worth reading multiple times to internalize all the knowledge. Huge recommendation from me.

Originally published at Datadriven-investment.com.

DockerCon EU 18 — Monitoring Docker Containers in Swarm mode

Frederik Banke — Mon, 03 Dec 2018 20:00:26 GMT

At DockerCon EU 18 I held a tech talk about monitoring Docker containers in swarm mode. Since the talk was only 20 minutes, it was not possible to cover all the interesting detail. This article provides some additional information and a tutorial for setting up a simple monitoring infrastructure for swarm mode. You can jump directly to the tutorial here.

The slides from the talk are available here.

Having organized logging will help you monitor and troubleshoot your systems faster and with more confidence. Good logging will help you no matter if you run containerized applications as or non containerized application.

The default logging setup in Docker swarm mode does provide a good starting point. But we can improve it a lot by setting up additional infrastructure.

Monitoring

When a system is running the load will fluctuate, the services will become slower or faster as the system evolves. But unless we monitor the numbers, we will not be able to see trends and proactively solve problems.

Some of the crucial aspects to monitor are

Response time on critical operations
Amount of requests to services
Server utilization(memory, disk, CPU)

It is vital to be alerted when anything unexpected happens like a disk running low on space or response time from a service peaking.

Monitoring allows us to scale servers before the load gets to high. Notify developers that their changes impacted performance. Add more disk space before the service breaks down. It allows us to sleep better at night knowing that we are alerted before a problem impacts production.

Troubleshooting

Something is bound to go wrong in any application, at some point! When errors happen, logs are needed to see the error message and figure out what caused the problem.

In many cases, the error message does not tell a complete story of the error by itself. Instead, we need to correlate the error messages from multiple servers to piece together the interaction the user had with our application leading up to the error.

Problems with logging

If no special attention is given to logging, usually what happens is: Our systems start small, and we do not think about logging. We can easily understand the whole system, the few times we need access to log data, we just log in to the server and view the information. As our system grows with more replicas for each service and more servers, it gets increasingly painful to do that.

Also, the log format from different services is in various formats. Making it mentally challenging to correlate in a stressful environment trying to fix an error.

But we can do better to make it easier and gain more value from our log data.

Logging with Docker

Docker containers should adhere to the single responsibility principle. It essentially means that a container should only run a unique process.

The main process in the container writes all the log data to console in the STDOUT and STDERR pipes. If you use the command:

docker logs -f

you will be able to see how it looks.

This example is from an NGINX web server; it outputs its access log to the console.

This first step with docker logging does not solve any of the problems, each service still just logs to the local server it is deployed on. But docker swarm builds on the concept to improve on it.

Logging with Docker Swarm

A Docker Swarm consists of some nodes, and the services are replicated and spread across the nodes. One of the cool things about swarm is that logs are centralized to the management nodes.

All the replicas report logging back to the management nodes which allows us to view an aggregated log from across all the replicas.

The example here is from an NGINX web server with two replicas. If you compare the output to the example above additional information is prepended to each line, the IP of the host and the container identifier. The information is added by swarm to allow us to identify where the log data originates from.

It slightly improves the troubleshooting situation, but we still need to view a large amount of log data on the screen and correlate it across services to pinpoint problems. Making the job less than optimal.

Setting up centralized logging

To further improve the logging, we need to collect all logs to a centralized location to make both monitoring and troubleshooting easier.

Such a setup consist of four components as shown here.

Services the service stack swarm that generates logs

Collector a services that gather all the log data transforms it into a unified format and forwards it to a storage system.

Storage: The place where we store the log data. It can be any type of storage as long as it allows us to access it easily. Everything from highly advanced NoSQL indexing systems to flat files can be used, depending on our needs. Even combinations are often used.

Analyzer: The software used to search and view the log data. Most analyzers also support alerting so we can get notified when errors happen without us actively looking at the dashboards.

I admit it, this stack of services, just for logging, is way more complicated than just writing data to flat files, directly on the server. But it will give us superpowers with our log data, so the added complexity is worth it.

Each of the services has many different options, and the combinations are almost endless, not allowing us an easy choice of components.

The easy choice is to use one of the popular stacks like the ELK(Elasticsearch, Logstash, Kibana) or TICK(Telegraf, InfluxDB, Chronograf, Kapacitor). But other options are also available.

My logging stack

I use another logging stack, mostly because I find the other stacks to complex for my need.

The following components have helped me gain value from my log data.

Collector: Fluentd
Storage: InfluxDB
Analyzer: Grafana

Fluentd — Collector component

The purpose of any collector is to act as a bridge between our log generating services and our storage system. Without a centralized collector we usually end up with a huge mess where there are custom scripts to parse log data, not all data can be piped to all the needed storage systems.

That is the situation fluentd sets out to solve. It can accept data from almost any source and output to a large number of systems you are not limited to output all data to a single source. With fluentd, you can have critical errors logged to slack or email to have instant awareness of errors.

Fluentd excels in taking logs formatted as text and parse them into JSON format. The now structured data can then be forwarded into storage for further processing.

InfluxDB — Storage component

InfluxDB is a time series database. It is designed for handling a large amount of time-structured data. It features an SQL like query language that will be familiar if you have used SQL.

When dealing with log data, in most cases the newest data will be more relevant and essential than older data. InfluxDB supports retention policies to help us manage the data. We can set it up to automatically aggregate the data as it ages, and only keep the newest data in full resolution. It helps us manage the space requirement of the log data.

Grafana — Visualizer component

Grafana is a great tool to visualize data. It excels in plotting graphs and setting up threshold values to give automatic alerts. You can set up any amount of dashboards to allow fine-grained monitoring of your systems.

It can read data from many different data sources, which means that you are by no means locked to InfluxDB, and they can even be combined if you need data from various sources.

Grafana’s is built for visualizing data, which means that it is not useful for searching in log data to troubleshoot. There is no built-in way to explore the data. But its strong point is that dashboards and graphs can be set up using the interface, no mocking around in config files are needed to begin using it.

Logging flow

A logging flow creates a pipeline from the docker containers, through fluentd and into InfluxDB. Every log message is passed this way through.

When a log message is created inside the container, it is passed to the log driver. It is a component inside Docker. We will configure this component to forward the message to the collector. When fluentd receives the message, it will parse it into JSON format and forward it to InfluxDB for storage.

And last, Grafana can query the data in InfluxDB to visualize it using its dashboard functionality.

Docker log driver

The logging driver is the service in Docker that allows us to get log data out of a container. One thing to be aware of is that it runs as an infrastructure component inside Docker, so it does not have direct access to the same mesh network that our service stack uses. It means that it will not be able to publish log messages to the fluentd instance running inside the service stack unless fluentd’s port is published.

Tutorial for getting the setup running

The easiest way to try out the setup is to use Play with Docker it provides a web interface to a Linux server where you can start Docker containers and try out the setup. It will allow sessions of 4 hours, enough to test stuff out.

First, you need a new instance to work with, when you create that you get a console.

At the top, you can see the IP address of the instance.

First, we need to initialize our swarm

docker swarm init — advertise-addr=

Next step is to deploy the service stack; you can find an example docker-compose.yml file here

version: "3.2"
services:
    webserver:
        image: nginx
        ports:
            - "8080:80"
        logging:
            driver: fluentd
            options:
                fluentd-address: 127.0.0.1:24224 # this is the port published by the fluentd service below
                fluentd-async-connect: 1
                tag: httpd.nginx

    fluentd:
        image: papirkurvendk/fluentd-influxdb
        volumes:
            - fluentd:/fluentd/log
        ports: # needs to be exposed for the logging driver to have access
            - "24224:24224"
            - "24224:24224/udp"

    influxdb:
        image: influxdb
        volumes:
            - influx:/var/lib/influxdb

    grafana:
        image: grafana/grafana:5.3.4
        ports:
            - 0.0.0.0:3000:3000
        volumes:
            - grafana:/var/lib/grafana

############## Data persisted on host #######
volumes:
    influx:
        driver: local
    fluentd:
        driver: local
    grafana:
        driver: local

docker-compose.yml view raw

You can add this to the instance by using the command

cat > docker-compose.yml

Copy the data into the console. End with Ctrl + C. Next step is to start the service stack

docker stack deploy — compose-file docker-compose.yml test

Now you should see output like this, showing that the services are started.

You should be able to access the website by clicking the port number link at the top of the interface. Sometimes they do not show up, but you can create the URL manually. You need the part marked with yellow below. It is added into the URL below. When you get it right, it should show the “Welcome to Nginx!” message from the website.

http://-8080.direct.labs.play-with-docker.com/

Setting up the logging infrastructure

If you look in details on the docker-composer.yml file, you will see that there are options for logging on the web server service. It is set up to deliver log data to fluentd on 127.0.0.1:24224

But if you run the following command, you will see that the fluentd container keeps restarting.

docker ps

It happens because it can not find the database in InfluxDB because we have not created it yet, as seen by viewing the logs from the container.

We create the database by running this command

docker exec `docker ps | grep -i influxdb | grep -v papir | awk ‘{print $1}’` influx -execute ‘CREATE DATABASE webserver’

Now you should see that the fluentd container keeps running. And now data is being stored in the InfluxDB database automatically.

The final step is to log in to Grafana and configures it to display the data. You use the same URL that you created to access the website, just exchange the 8080 with 3000. That is the port number we are replacing.

It should give you the login page for Grafana

The first thing we need to do is to add a data source, so Grafana knows where our data is located. We enter http://influxdb:8086 as the URL and “webserver” as the database name as shown below.

Grafana should respond that the data source works.

Now we can create a graph of the stored data. On the left side of Grafana is a plus where we can add a dashboard and on the dashboard we can add a graph.

It should create an empty graph which we can edit on the top arrow.

Edit the query to match the settings here

Now you will have a live updating graph counting the number of requests the Nginx container is processing. You can use this as a template to add more graphs.

It will depend a lot on your needs which graphs that are important to your setup.

Thanks for reading

I hope this article will inspire you to add more monitoring into your Docker set up. It takes a bit of effort to master it, but it will add tremendus value. If you have any questions or something is unclear, feel free to leave a comment, then I will try to clarify.

Originally published at Datadriven-investment.com.

OAuth and OpenID Connect for dummies

Frederik Banke — Tue, 23 Oct 2018 18:36:37 GMT

To learn more about how and why OAuth 2 works the way it does, I took part in a workshop hosted by curity.io as part of the Nordic APIS summit 2018. The workshop covered the basics of OAuth 2 and OpenID Connect. I have worked a little bit with OAuth 2 before so I knew the basics, but the workshop helped me gain a better understanding of the protocol and the different parts of it.

I have tried to describe the learnings I had from the workshop here both for my reference and it might help others understand OAuth 2 better.

The first important point to acknowledge is that OAuth 2 is designed to solve one use case, which is not authentification as the name suggests. Instead, it solves this problem: A user has access to a resource and want to allow a third party to have the same access.

In the “old days” this could be solved by the third party saving the username and password of the user, which would allow the third party to impersonate the user when accessing the resource.

But that is less than optimal because of two points; if the user changes the password, all third parties lose access. And if the user wants to revoke the access, this can not be done without changing the password.

OAuth 2 to the rescue

The standard has many different parts which make understanding it quite complicated. Much of the specification is also unimportant for most use cases, but it clouds the understanding that the specification is so large. It is not that the concepts are that difficult when explained, but there are many places where it is easy to trip over the details.

Firstly, OAuth defines four different actors. That is persons or systems that have a specific role to play when a user delegates access to a resource to a third party.

We have the user who is typically a real person, in OAuth jargon, this is the resource owner (RO)
Then we have the resource, that would usually be an API, in OAuth jargon: Resource Server (RS)
There is also the third party; this could be a mobile app, a website or another system that needs to access the resource, in OAuth jargon: the client
Finally, there is the OAuth server, in OAuth jargon: the authorization server (AS)

Since OAuth does not do authentication most OAuth systems have or integrate with some kind of authentication service. Because we usually need to validate that the user is who they say they are.

OAuth grant flow(s)

The four actors need a protocol to communicate with each other securely, to allow access to be delegated. The protocols are called “flows” in OAuth 2.

Since different applications have different requirements, there are four different flows or protocols for how the actors communicate. Because why make one flow if we can make many :-) The largest flows are authorization code flow and implicit flow. I will only cover the code flow; the other flows are simpler so they should be easier to understand.

Authorization Code Flow

If possible this is the flow to use; it is the most secure of the four flows. It should be well known from Google login, Facebook login and other login systems where you can log in to a third party website using your existing profile.

The protocol is shown here and requires a few more terms to explain. Each arrow signals an HTTP request or response.

As explained in the section above about actors, there is a client. A client consists of two parts, a public, and a private part. Shown here as “User-agent” and “Client.” Think of the User agent as the user’s browser, and the Client as a backend part hosted on a server. Why that is needed will be apparent later.

The flow consists of 6 steps A-E, in the final step an access token and refresh token is provided to the client. It is the access token that can be used to access the protected resource, not shown in the image.

A. First, the User-agent queries the AS for an authentication code

With a simple GET request, the AS is notified that we want to request access to a resource.

https://oauthserver/authorize?client_id=awesomeclient&response_type=code&scope=read&redirect_uri=https://awesomeaplication.com/callback

The client_id is an is that is registered with the AS when the client was set up. It is always the same across all requests and all users.

response_type=code signals to the AS which protocol/flow we want to use.

scope=read is a list of all the different scopes we wish to request access to. A scope is a name which the AS and the RS agree on, and it has specific meaning for the RS. For a CRUD API, there could be scopes for each operation type so a user could grant the client access to read but not to write.

redirect_uri is where the user agent is redirected after the authorization code is granted. This URI is often validated against a predefined list in the AS, making it essentially the same URI every time.

B. Authenticating the user

Since neither the AS nor the User-agent knows who the user is, a redirect to the authentication service is made. How this authenticating happens is not a part of the OAuth protocol and is up to the implementation of the service.

Usually, the user is presented with a login screen from the authentication service along with a list of the scopes. Here the user can allow or deny access to the different scopes. A well-known example of this is when a Facebook application wants access to your friend list, post on your wall and so on. Each of the types could be a scope that the user can allow or deny access.

When the user is authenticated the AS is signaled by the authentication service with user information.

C. Getting the grant code to the user agent

Next step is that the AS sends the authorization code back to the User-agent as part of the registered redirect_uri. The code is transmitted as a GET parameter with the name code. It will look similar to this:

https://awesomeapplication.com/callback?code=

The code is a onetime use code. It is used to request the AS for access tokens.

D. Getting authorization code

The reason for splitting the client into two parts(User agent and client) is security. The user agent could be a JavaScript application meaning that all its parts are public. But to get the access token from the AS we need a secret that is shared between the AS and the client, to allow the AS to validate that it is the true client it is communicating with. But the secret cannot be hosted in a public client that can be decompiled by anyone.

Instead, the authorization code is passed down to the backend, the client. And here we can store our secret away from snoops.

E. Getting the access token

Finally, the client(backend) can use the authorization code to request an access token from the AS. The client will request using the following URI:

https://oauthserver/token?client_id=awesomebackend&client_secret=123&grant_type=authorization_code&code=

A thing to notice is that the client_id is different in the backend compared to the frontend. It is not a requirement but will make the flow more secure.

The response from the AS is a JSON string with an access_token and optionally, a refresh_token.

Using the tokens

The interesting part is of course to call the APIs we need, not authorization. Strangely enough when I read about OAuth this part is usually omitted for some reason.

The flow ends with the client having an access token that is just a random string provided by the AS. It does not have any meaning by itself.

It is used when requesting resources from the RS. The token is added as a Bearer token header on each request like this:

Authorization: Bearer

When the API receives the access token it needs to validate it, this is done using a process called introspection, which just means that the RS calls the AS and asks if the token is valid. If the token is not valid the response is just the JSON string { “active”: “false” }

But if it is active, the response is a JSON string showing the client_id, scope, and expiration time. Enough information for the API to validate the request and grant access.

How does a real OAuth session look like

Puh that was quite a write-up. It seems very complicated but let us try to break it down to the actual HTTP calls.

For this demo, I have used Googles OAuth server. It requires a project created in their API console. The flow is also described here.

The first request tells Google looks like this:

client_id is obtained from when the project was created
redirect_uri must match the URI created in the project
scope must be a valid scope from Googles list of scopes
response_type is set to code since it is a code flow we want.

Since the user is not authenticated the first window that is shown is the login prompt.

When I sign in Google asks me if I want to allow the application to access the requested scope.

When I accept that, the browser is redirected to the URL

http://localhost:54000/?code=4/gABji6OhsRkhZyvey8TX_BMrQBV9mWkFYGZKm9jYf06u8BE-Vhxs9f0dhpQqH3aZq9ySCBzmz9B2p7QF4blvsdE&scope=https://www.googleapis.com/auth/drive.metadata.readonly

The URL is the redirect_uri defined above with the authorization code appended.

The authorization code is used to request the access_token like this:

The endpoint returns an access_token that we can use to query Googles APIs. We can not introspect the token directly, that is done inside Googles APIs when we make a request to them.

Access token and refresh token

An access token is usually short-lived, like 5–10 minutes. When it expires, the refresh token which is much longer lived can be used to request a new access token from the AS. It is done through the /token endpoint on the AS.

When a refresh happens a new access token, and a new refresh token is given, replacing the old tokens.

Revoking rights

If we revoke an access token, that particular token will be invalidated, and all requests with that token will be denied.

But if the client still has a refresh token, it can be used to request a new access token without any problem.

But if the refresh token is invalidated all the access tokes associated with the refresh token is invalidated, which is usually what we want.

OAuth provides an endpoint to do revoking, allowing the application to have a button where the user can withdraw access. But usually, I think this is done by the user directly in the AS which should provide a UI for it.

OpenID Connect

OAuth 2 only does delegate access, but identification/authorization is needed as well. To support this use case OpenID connect builds on top of OAuth 2.

When using OAuth 2, all tokens are just random strings which do not tell us anything about the user. Only when the resource API, in the end, asks the AS if the token is valid information about the user is given.

In modern web application, we would like to show a “Welcome xxx” to the user at least, so it is customized. But since the token is just a random string, it will be difficult to support this use case.

If the AS supports OpenID connect the scope openid can be added to the initial request which triggers a new token to be enabled, the id_token.

JSON Web Tokens (JWT)

An id_token is a signed and base64 encoded string, when decoded it contains the following information:

{
  "sub"       : "alice",
  "iss"       : "https://openid.c2id.com",
  "aud"       : "client-12345",
  "nonce"     : "n-0S6_WzA2Mj",
  "auth_time" : 1311280969,
  "acr"       : "c2id.loa.hisec",
  "iat"       : 1311280970,
  "exp"       : 1311281970
}

The token is signed by the AS and using a public key each client can verify that the token is valid. It makes it redundant to introspect the token to get the information, decreasing the load on the AS.

The id_token is only part of the response when the user is involved, never during a refresh request.

Each OpenID Connect server must answer on the endpoint /.well-known/openid-configuration as Google do. The JSON response contains information about all the OAuth 2 endpoints needed for the flows and a list of how to get the encryption keys to validate the id_tokens.

What to be aware of in a real setup

When a setup grows with many different APIs, we get many introspection requests to the AS. The number of queries can overwhelm the AS and cause it to become a bottleneck.

One way to avoid this is to implement a cache as a reverse proxy in front of the APIs. This reverse proxy introspects each token only once and stores the reply as a JWT. This JWT can then be sent instead of the access_token to the APIs, and with this setup, each API do not need to query the AS.

This article has only just scratched the surface of OAuth 2 / OpenID Connect, but I hope it gives an overview of how the technology works.

Originally published at Datadriven-investment.com.

Scalable baseline website setup with authentification and VueJS, Amazon S3 and .net core 2.1

Frederik Banke — Sat, 08 Sep 2018 07:33:22 GMT

For my projects, I need a generic website setup that I can reuse for multiple projects.

I want to try out the following setup. A frontend build in Vue served as static files from Amazon S3. A backend built with .net core 2.1 as a REST API presented with Swagger. Finally, using Googles firebase authentification for login requirements.

Since I need a baseline platform for multiple projects, it needs to be generic enough to allow me to reuse the setup. Most of my projects need a similar setup with a frontend exposed to anonymous users and a backend dashboard which requires authentification.

In this article, I am going to cover how I set up the Vue frontend. In later articles, I will cover the authentification and the backend.

If you do not know VueJS or one of its friends you are missing out, It is a revolution for building reactive frontend websites. It is part of the suite of modern Javascript frontend frameworks like Angular and React; you can find more information and a comparison of them in this article by Jens Neshaus. All three frameworks are build to support reactive websites where a state in Javascript is synchronized with the DOM.

I do not have any specific reason for choosing VueJS except that I use it as part of my work, so it creates a synergy effect. And it seems to have the easiest learning curve of the bunch.

The cool thing about Vue is that it compiles into static Javascript files that can be executed in the browser. It means that nothing special is needed to run the frontend, it can be served as 100% static files.

It has the significant effect that it can easily be replicated to edge nodes all around the world to make sure that users around the globe can load the site fast. Getting data from the backend is another story for a later article :-)

Finding a VueJS dashboard template

For most of my projects, I foresee that some kind of administration interface is needed. Which means a login page with some type of dashboard behind.

My design skills are not that good so I prefer to find a template to build from, there exist multiple templates, but I found Vue Paper Dashboard to strike the right balance between aesthetics and functionality. CoreUI is another alternative.

Vue Paper Dashboard

Setting up the build process

If you clone the git repository, you get a complete setup, including a build script. So we only need to have npm installed to build everything. Run the commands

npm install
npm run build

After they complete, the build files will be available in the dist folder. Uploading this folder to a web server and we have our own version of the dashboard running. Easy peasy.

Development with Vue

Except for the build command, the template also supports npm run dev which runs a development web server with hot-reload. It means that when the files are changed the website will automatically reload the changes, making development super neat.

Unit tests

One of the shortcomings of the paper dashboard template is that it does not have a setup ready for unit tests and linting. Vue is just javascript, so any test runner should work. I went with Jest to support the testing. It is one of the most complete testing frameworks for javascript.

Combined with vue-test-utils we can write tests like this:

import { shallowMount, createLocalVue } from '@vue/test-utils'
import Vuex from 'vuex'
import ForgotPassword from '../../src/pages/ForgotPassword.vue'

const localVue = createLocalVue()
localVue.use(Vuex)

describe('Login', () => {
  let actions
  let store
  let getters
  let mocks

  beforeEach(() => {
    mocks = {
      $t: (msg) => { return msg }
    }
    actions = {
      forgotPassword: jest.fn()
    }
    getters = {
      error: jest.fn()
    }

    store = new Vuex.Store({
      state: { error: undefined, loading: false },
      actions,
      getters
    })
  })

  it('sets the correct default data', () => {
    expect(typeof ForgotPassword.data).toBe('function')
    const defaultData = ForgotPassword.data()

    expect(defaultData.email).toBe('')
  })

  it('triggers forgotPassword action on submit button click with data', () => {
    const wrapper = shallowMount(ForgotPassword, { localVue, store, mocks, stubs: ['router-link'] })

    wrapper.setData({ email: 'e@mail.com' })
    wrapper.find('#ForgotPassword').trigger('submit')
    
    expect(actions.forgotPassword.mock.calls).toHaveLength(1)
    expect(actions.forgotPassword.mock.calls[0][1]).toEqual({email: 'e@mail.com'})
  })
})

Jest test view raw

It is a test of a forgot password Vue component. It depends on Vuex and Vue-i18n which we need to mock out. But as you can see in the bottom test it is quite easy to test that filling the email field and clicking the button triggers the call to Vuex.

It did take me some time to get the setup quite right. Vue components are not real javascript, so Jest needs to know about how to load them. But luckily it can all be set up in the package.json file.

{
  "name": "vue-paper-dashboard",
  "version": "1.0.0",
  "private": true,
  "scripts": {
    "build": "vue-cli-service build",
    "e2e": "node test/e2e/runner.js",
    "lint": "vue-cli-service lint",
    "lint-fix": "vue-cli-service lint --fix",
    "dev": "vue-cli-service serve --open",
    "test": "jest"
  },
  "dependencies": {
    "bootstrap": "^4.0.0",
    "chartist": "^0.11.0",
    "es6-promise": "^4.2.4",
    "firebase": "^5.3.1",
    "vue": "^2.5.13",
    "vue-clickaway": "^2.1.0",
    "vue-notifyjs": "^0.3.0",
    "vue-router": "^3.0.1",
    "vuex": "^3.0.1",
    "axios": "0.18.0",
    "vue-i18n": "8.0.0",
    "vue-loader": "15.4.1",
    "@kazupon/vue-i18n-loader": "0.3.0",
    "moment": "2.22.2"
  },
  "devDependencies": {
    "babel-loader": "7.1.5",
    "babel-plugin-syntax-dynamic-import": "6.18.0",
    "@babel/core": "7.0.0-rc.3",
    "@vue/cli-plugin-babel": "^3.0.0-beta.9",
    "@vue/cli-plugin-eslint": "^3.0.0-beta.9",
    "@vue/cli-service": "^3.0.0-beta.9",
    "@vue/eslint-config-prettier": "^3.0.0-beta.9",
    "babel-jest": "^23.4.2",
    "babel-preset-env": "1.7.0",
    "jest": "^23.5.0",
    "jest-vue-preprocessor": "^1.4.0",
    "jsdom": "^12.0.0",
    "node-sass": "^4.8.3",
    "sass-loader": "^6.0.7",
    "vue-jest": "2.6.0",
    "vue-server-renderer": "^2.5.17",
    "@vue/test-utils": "1.0.0-beta.24"
  },
  "description": "A sample admin dashboard based on paper dashboard UI template",
  "author": "cristian.jora ",
  "engines": {
    "node": ">= 8.1.4",
    "npm": ">= 5.0.0"
  },
  "browserslist": [
    "> 1%",
    "last 2 versions",
    "not ie <= 8"
  ],
  "jest": {
    "moduleFileExtensions": [
      "js",
      "json",
      "vue"
    ],
    "transform": {
      ".*\\.(vue)$": "vue-jest",
      "^.+\\.js$": "/node_modules/babel-jest"
    }
  }
}

package.json view raw

The important part is in the bottom, where jest is told how to parse .vue files using the vue-jest package.

Using Amazon S3 as a web host for static files

The cool feature of this set up is that the complete frontend is compiled to static files. It means that we can use any web server to host them with no specific requirements to the features.

It allows us to use Amazon S3 since it has edge locations all over the world and is easy to set up.

When you have created a bucket, just set public read permissions and select the feature “Static website hosting.”

Putting everything together

I am a big fan of automation, so to deliver code changed I want a CI/CD set up. My tool of choice is bitbucket pipelines.

I want to have the following happen.

When I commit and push to the master branch, a build should start
The build starts by compiling the source and run all the defined tests
If any test fails the build fails
If all succeeds the pipeline pauses
When I click a “deploy” button, the new version is pushed to S3 without any additional interaction.

image: node:8.11.3

pipelines:
  default:
    - step:
        caches:
          - node
        script: 
          - echo "VUE_APP_FIREBASE_APIKEY="$VUE_APP_FIREBASE_APIKEY >> .env.local
          - echo "VUE_APP_FIREBASE_AUTHDOMAIN="$VUE_APP_FIREBASE_AUTHDOMAIN >> .env.local
          - echo "VUE_APP_FIREBASE_DATABASEURL="$VUE_APP_FIREBASE_DATABASEURL >> .env.local
          - echo "VUE_APP_FIREBASE_PROJECTID="$VUE_APP_FIREBASE_PROJECTID >> .env.local
          - echo "VUE_APP_FIREBASE_STORAGEBUCKET="$VUE_APP_FIREBASE_STORAGEBUCKET >> .env.local
          - echo "VUE_APP_FIREBASE_SENDERID="$VUE_APP_FIREBASE_SENDERID >> .env.local
          - echo "VUE_APP_API_BASE_URL="$VUE_APP_API_BASE_URL >> .env.local
          - npm install
          - npm test
          - npm run build
        artifacts:
          - dist/**
    - step:
        # set AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY as environment variables
        name: Deploy to S3
        deployment: test   # set to test, staging or production
        trigger: manual 
        image: atlassian/pipelines-awscli
        script:
          - aws s3 sync --delete dist s3://xxx.xxx.xx

bitbucket-pipelines.yml view raw

It is quite simple; we tell the pipeline to load a docker image with node installed. Then we extract different environmental variables to a file. They are compiled into the application.

Then npm install installs all the requirements needed to build the project. Then npm test runs the unit tests. And finally, the npm run build creates the minified and compiled javascript files.

Then the files in the dist folder are stored as an artifact. The manual step reveals a button in the bitbucket interface that can be clicked to run the final step.

If any of the steps fail, the pipeline will stop. So there is no danger that we could deploy code that does not complete the tests.

You can find the complete repository here it is part of a larger project, so it is a snapshot in time, but feel free to ask questions on any part of it.

Originally published at Datadriven-investment.com.

Email forward for adresses where you do not want an account

Frederik Banke — Sun, 05 Aug 2018 08:26:21 GMT

If you like me have too many domains for different projects, email accounts get a problem. With any domain comes the responsibility to allow people to get in contact with you. Often I use a webmaster@domain or info@domain email address to enable visitors to contact me.

But At $4 / month for an Amazon Work mail account and $5 / month for a Google suite user maintaining many email accounts gets pricey. Also, it is a hassle to monitor all the accounts.

Often you simply want to have emails forwarded to a single email account, where it is possible to set up filters, and allows you to check mail in a centralized place. For example, have info@datadriven-investment.com be forwarded to frederik@patch.dk which is my primary email account. With minimum set up.

I need a service that allows me to quickly and cheap set up email forwarding to my email account hosted at Google.

Enter forwardemail.net! It is a free and simple system that provides email forwarding service for any custom domain. It requires a bit of knowledge about DNS, but if you have that, forwarding can be set up in 10minutes.

The service works by adding forwardemail.net’s MX servers and adding a TXT record signaling where emails should be sent.

Shout out to Niftylettice for creating such a seamless service.

Originally published at Datadriven-investment.com.

Amazon S3 backup strategy

Frederik Banke — Wed, 01 Aug 2018 13:00:40 GMT

Real men do not take backups, but they cry a lot

But I rather not cry too much :-) I try to have a good backup solution. After all, I do spend an awful lot of time creating data; it would hurt a lot if it were lost by accident. Especially since a backup is easy to set up and cheap.

Amazon S3 is my go-to solution for cloud data storage. It is designed never to lose data and to be resilient to disasters. On top of that, it is cheap.

In this article, I dive into what you need to know about Amazon S3 before you start using it for your backup solution.

S3 allows you to store an infinite amount of data. With files size up to 5TB per file. Essentially it should be enough for anyone, and even Netflix uses S3 for their storage needs. It is a well-designed storage solution that is scalable for many use cases.

You can interface with S3 in many ways; they are described here. Most essential for me is the Linux console utility s3cmd it allows me to upload any file or directory from a Linux machine to S3. That is what I use for backup.

S3 is designed with a REST API for interfacing so you can even build your own tools to interface with it.

In S3 you work with a “bucket” which is a name you create to reference the place where the data is stored. A bucket can be created using your Amazon Console.

The create wizard looks like below. You need to write a globally unique name as the bucket name. The rest of the settings you can leave to the default settings.

Each bucket is placed in the region you select. In the screenshot EU(Ireland) is selected. Acces is handled by the AIM system in Amazon, so you need to create a user to upload/download files. More on this later.

Set up backup

I need to backup the upload directory from my websites and a dump of the database. It is handled by two Docker containers. One to do file backup and another for database backup that I build. Both use the s3cmd tool to copy the files.

To have the s3cmd tool copy files, you need your access-key and secret-key. You can find them using the guide here.

Now you can copy data in your S3 bucket with the s3cmd tool!

S3 Storage classes

Each file in S3 is placed in a specific storage class. You can see the storage class next to each file.

In the standard storage classes, each file is replicated to three different Availability Zones inside the selected region. It also supports low latency access.

There are other storage classes as explained here. We hopefully do not need low latency read access to our backup, so we might want to save some money selecting Standard-Infrequent-Access storage class instead. The storage price is lower, at around 50% cheaper than Standard. But data upload and download are more expensive. So you would need to calculate if it makes sense to change it. I think it would mostly make sense to do if you store backups for a long time.

S3 supports lifecycle policies for stored files. That allows you to modify the files stored in S3 with a set of rules. It will enable, for example, to transition files from one storage class to another.

You can also add rules to delete files automatically and much more. This makes it easy to set up something similar to

Backup the web files every day — this could be handled by s3cmd and cron
Delete all backups older than 30 days
Keep one backup every week for one year
Automatically change the storage class on the weekly backup to Glacier

I retain the last 30 days and delete everything older than that. So my setup is quite simple.

More security with geo-replication

Amazon has designed their services to be resilient to disruptions. Each region is able to run even if other regions are not available. Amazons track record is quite good with regard to downtime. So I do expect the normal inter-region replication to be very sage. But we can easily set up S3 to replicate all our files to a bucket in another region. This provides maximum safety for our files.

Setting up replication is easy. You need to create a new S3 bucket in another region. Just a normal bucket.

Then you can set up replication on the original bucket and select the newly created bucket as the destination.

Notice that existing objects are not replicated automatically, only newly created objects. So if you need your old backups replicated you need to copy them manually.

All user actions are replicated, like new file and deletes. But lifecycle policy actions are not replicated. So if you use lifecycle policies to change storage classes or delete expired files, you need to create the policies in the remote bucket, they are not created automatically.

That is all there is to it; now you have a geo-replicated S3 bucket.

Final thoughts on S3

I find it very cool to have a cloud storage with infinite storage capacity. When it is so easy to interface to, then it is useful not only for backup but for many other purposes.

I hope to investigate more use cases for S3 in the future.

Originally published at Datadriven-investment.com.

Docker setup monitoring

Frederik Banke — Thu, 19 Jul 2018 15:48:28 GMT

We do not log in to our servers every day to check how the resource usage is. Just like with uptime monitoring we need a system to help us monitor if everything is inside reasonable limits so we can scale the servers if required. And detect any potential problem before it becomes a problem.

In this article, I will explore how to set up monitoring using, Docker, influxdb, grafana, cAdvisor, and fluentd.

I used this excellent article by Hanzel Jesheen as the starting point for my monitoring setup. It explains how to setup influxdb, grafana and cAdvisor. If you follow that guide, you will have a great starting point.

Monitoring Docker containers

We need to know how much CPU, memory, and disk usage each container and Docker host consumes. That information is collected by cAdvisor. Which is a project by Google:

cAdvisor (Container Advisor) provides container users an understanding of the resource usage and performance characteristics of their running containers.

It runs as a container on each Docker host, it then gathers the data and pushes it into influxdb.

Setting up cAdvisor is easy, I just added the following to my docker-compose.yml file

cadvisor:
    image: google/cadvisor
    hostname: '{{.Node.ID}}'
    command: -logtostderr -docker_only -storage_driver=influxdb -storage_driver_db=cadvisor -storage_driver_host=influx:8086
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:rw
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    deploy:
      mode: global

It has “deploy mode” set to “global” to make sure it runs on all our Docker hosts. It also has to know where the influxdb instance is running; this is provided by the parameter -storage_driver_host=influx:8086

Then it needs access to some Docker specific files so it can collect stats. And that is the only thing required. When running, each cAdvisor instance consumes ~40MB on my setup.

Influxdb

Influxdb is a time series database; it is built with only this purpose in mind. So it supports some neat features needed when working with time series. For example, it has retention policy support, meaning that if you do not need old data, you can set an expiry date on it.

Influxdb uses a SQL like, query language to get data. If you had any exposure to SQL, you should find it quite easy to work with.

An example of the query I use to show the average response times for Nginx:

SELECT mean(“request_time”) FROM “autogen”.”httpd.loadbalancer” WHERE $timeFilter GROUP BY time($__interval) fill(null)

Very similar to SQL.

Setting up influxdb is also just an entry into the docker-compose.yml file:

influx:
    image: influxdb
    volumes:
      - influx:/var/lib/influxdb
    deploy:
      placement:
        constraints:
          - node.role != manager
      resources:
        limits:
          memory: 350M

Influxdb is memory hungry, so I added a memory limit of 350MB which has been working out fine for me. It usually uses around 250MB, so there is room to spare if it needs more memory.

The data collected in influxdb is not super relevant to me, so I have not added any backup routine to it. If it crashes and I loos the data it is not that big a deal.

The last piece of the puzzle for monitoring the hardware is grafana which will do the actual visualizations.

It is similarly easy to set up. Just add an entry into the docker-compose.yml file:

grafana:
    image: grafana/grafana
    ports:
      - 0.0.0.0:3000:3000
    volumes:
      - grafana:/var/lib/grafana
    deploy:
      placement:
        constraints:
          - node.role == manager

You can see in the article mentioned earlier how to set up grafana once it is deployed.

It will give a dashboard like this:

It shows RAM, Disk, CPU, and network usage. It allows us to see it per Docker host or container. Which is all I need to be able to monitor the health of the nodes.

Monitoring Nginx log files

Since the setup run websites, it is essential to monitor the log data from Nginx. Luckily, Docker supports a logging system that allows us to centralize the log data. It has many different drivers; I am going to use fluentd, it is easy to configure and use.

With Docker, you can run the command docker logs and then it will show you the log data for that container. But that is not very useful when we want to analyze the data. The log drivers in Docker allows us to send the log text to a service instead. Which will enable us to centralize the log information.

It can be setup using the docker-compose.yml file again.

loadbalancer:
.....
    logging:
      driver: fluentd
      options: 
        fluentd-address: 127.0.0.1:24224 # see info on fluentd container - mesh network will route to the correct host
        fluentd-async-connect: 1 # if the fluentd container is not running, just buffer the logs until it is available
        tag: httpd.loadbalancer

fluentd:
    image: 637345297332.dkr.ecr.eu-west-1.amazonaws.com/patch-fluentd:latest
    ports: # needs to be exposed for the logging driver to have access, it is firewalled to deny outside access
      - "24224:24224"
      - "24224:24224/udp"

A container running fluentd service is needed. It has the port 24224 exposed. Then the load balancer logging system is configured to point to fluentd. Each log line that is sent to fluentd is tagged with the tag “httpd.loadbalancer” this allows us, inside fluentd to configure what to do with the loglines from this service.

Notice that the logging system in Docker does not run inside the same network as the services themselves, so for the load balancer service to send logging statements to the fluentd service, the ports need to be mapped to allow external access. This is the reason that the load balancer points to 127.0.0.1. But the fluentd service and the load balancer services do not need to run on the same Docker host, because the routing mesh will make sure the data is sent to the correct container. Please use a firewall to make sure fluentd is not accessible from outside.

Fluentd needs the influxdb plugin to be able to send data to influxdb. It can be accomplished using this Dockerfile

FROM fluent/fluentd:v0.14-onbuild
MAINTAINER Frederik Banke frederik@patch.dk

RUN apk add --update --virtual .build-deps \
        sudo build-base ruby-dev \
 && sudo gem install \
        fluent-plugin-influxdb \
        fluent-plugin-secure-forward \
 && sudo gem sources --clear-all \
 && apk del .build-deps \
 && rm -rf /var/cache/apk/* \
           /home/fluent/.gem/ruby/2.3.0/cache/*.gem

EXPOSE 24284

And finally, we need a config file to translate the log data from the load balancer. Because I use a different kind of log format than the standard Nginx format, it needs a special setup in fluentd to parse it. It is accomplished using regex inside fluentd.

As shown here:


  @type forward



  @type parser
  key_name log
  
    @type regexp
    expression /\[(?[^\]]*)\] (?[^ ]*) - (?[^ ]*)- - (?[^ ]*)  to: (?[^ ]*): (?[^ ]*) (?[^ ]*) (?[^ ]*) upstream_response_time (?[^ ]*) msec .*? request_time (?[^ ]*) upstream_status: (?[^ ]*) status: (?[^ ]*) agent: "(?.*)"$/
    time_format %d/%b/%Y:%H:%M:%S %z
    time_key logtime
    types request_time:float,upstream_response_time:float
  



  @type stdout



    @type influxdb
    dbname nginx
    host influx
    
      flush_interval 10s

The article by Doru Mihai about fluentd regex support was a great help.

The first tells fluentd that it should accept all incoming data and forward into the processing pipeline.

The element matches on tags, this means that it processes all log statements tags that start with httpd. And in our case, the tag name is “httpd.loadbalancer”.

The first statement matches all statements tagged with any other tags and writes it to standard output in fluentd.

The second match statement process all tags with the format . and writes it to influxdb.

With data available in influxdb we can create a dashboard in grafana to display the data.

I want to show the following information: A raw log with all requests, the HTTP response code, and timings both from the load balancer and the upstream servers. I also want to show a graph of the number of requests, divided into the request type GET/POST/HEAD. I also want a graph showing the average response time. And finally, I want a table collecting all “slow requests” for me that is any request that took more than 2 seconds to process.

This dashboard looks like this:

Conclusion

The setup is easy to extend, any log data from any Docker container can be added to fluentd and processed into influxdb and grafana. It makes it easy to add any kind of graphs to the dashboards at a later time.

Originally published at Datadriven-investment.com.

AWS load balancing Docker hosts and pain with HTTPS

Frederik Banke — Sat, 30 Jun 2018 12:06:08 GMT

In my continuous effort to make my setup as redundant as possible, the next step is to add a load balancer. I ran into a few problems while setting it up, allowing me to share my experience.

It is not possible to have the apex record of a domain point to a CNAME.
Moving the domain names of a service that runs HTTPS requires great care.

I added a network load balancer to sit in front of my Docker hosts to allow them to be fall over for each other. After that, I moved datadriven-investment.com to www.datadriven-investment.com because of the problem with apex DNS record mentioned above.

On AWS there are two options for load balancing. Application level or network level.

Using an application-level load balancer requires it to do the SSL termination as explained here. That is possible using an application level load balancer since it works on the HTTP level.

The other load balancing type is network level. It load-balance TCP traffic, without regard to which application is running across the connection.

SSL termination is handled by my Nginx server, and I do not want to change that, so I need a network load-balancer.

The setup currently looks like this:

Here a static IP(Elastic IP) is bound directly to one of the docker hosts(the left docker square in the diagram). It creates a single point of failure. If this Docker host stops working no traffic will be processed, even if the other Docker host keeps working as expected.

A load balancer from AWS can amend this problem. So the architecture is transformed into this instead.

Here all traffic enters the AWS load balancer which knows about both of the Docker hosts and distributes traffic to both of them. Docker expands a mesh network across all nodes in the same service stack. It has the effect that no matter which Docker host gets the traffic, it is automatically routed to the correct container, even if the container is located on a different host.

It allows the Nginx load balancer to run on any of the nodes. If it’s Docker host goes down, it will automatically restart on the other Docker host and continue processing traffic.

Setting up AWS network load balancer

Using the wizard in the AWS console, it is easy to set up. A load balancer consists of two concepts, a target group, and a listener.

A listener consists of a TCP port, for example, 80 for HTTP traffic and 443 for HTTPS traffic. Each listener sends traffic to a target group.

A target group consists of a list of instances that are available for accepting traffic.

That is the setup. When traffic hits the load balancer on a port it is listening to; it will route the traffic into the matching target group. From here the traffic is balanced between the available instances.

The load balancer also supports health checks on the instances, so traffic is only routed to instances that are up.

The wizard looks like this:

First, select if the load balancer should be public to the internet or internal. In this setup, an internet-facing is needed.

Next step is to select which TCP ports to listen to, 80 for HTTP and 443 for HTTPS traffic.

Finally, we need to select an availability zone. In my case, the hosts are located in “eu-west-1c”.

Then we need to add the targets. We can select the ec2 instances to handle the traffic.

That’s it! Now any traffic hitting the load balancer will automatically be balanced between the ec2 instances. I had no problems while setting it up.

Moving from a domain without www to, with www

To allow for better uptime Amazon does not reserve a specific IP address to their load balancer. Instead, they provide a DNS name that we can point to using a CNAME record.

The problem is that this website was running on “datadriven-investment.com,” but it is not possible, in most cases, to use a CNAME record on the apex record for a domain. As described here.

It left me with two options.

Change the domain name to “www.datadriven-investment.com" and use a CNAME record.
Use AWS Route 53 which support an ALIAS type on the apex record. It works exactly like a CNAME except it is actually an A record that AWS updates if the CNAME underneath changes.

I ended up implementing both. Changing to use “www.datadriven-investment.com" allow the setup to be vendor agnostic if I should choose to change hosting partner in the future.

Using an ALIAS record on the apex record allows backward compatibility. If the domain were just changed, any links pointing to “datadriven-investment.com” would break. But I would like to preserve the functionality, so any click on a link is automatically redirected to “www.datadriven-investment.com"

Moving domain on an HTTPS site

Most DNS services support an automatic redirect system which would allow us to redirect all requests from “datadriven-investment.com” to “www.datadriven-investment.com." But unfortunately, this will only work for old school HTTP sites. As soon as there is encryption involved, we must maintain a valid certificate for the redirect to work. It is often not supported by the DNS service.

So we must build and maintain it our selves. I use Let’s Encrypt for my certificates; luckily they support multiple domains in the same certificate file. It makes it much easier to use.

docker run --rm \
  --name letsencrypt \
  -v "/data/storage/letsencrypt/etc/:/etc/letsencrypt" \
  -v "/data/storage/letsencrypt/lib:/var/lib/letsencrypt" \
  certbot/certbot certonly -n \
  -m "frederik@patch.dk" \
  --agree-tos \
  -d www.datadriven-investment.com -d datadriven-investment.com \
  --webroot --webroot-path /var/lib/letsencrypt/datadriven-investment.com/ \
  --expand

The difference between this command and the command in the previous article is that www.datadriven-investment.com is added along with the “ — expand” parameter. It will create a single certificate for both domains.

The Nginx config looks like this. Not very different from the previous setup. Just a separate virtual host for datadriven-investment.com and www.datadriven-investment.com to create the automatic redirect. The first virtual host is for normal HTTP and answers on both domains. It allows Let’s Encrypt to renew the certificates.

upstream datadriven-investment-loadbalance {
    server http;
}

server {
    listen 8080;
    server_name www.datadriven-investment.com datadriven-investment.com;
	
	# Rule for legitimate ACME Challenge requests (like /.well-known/acme-challenge/xxxxxxxxx)
    location ^~ /.well-known/acme-challenge/ {
        # No HTTP authentication
        allow all;
    
        # Set correct content type. According to this:
        # https://community.letsencrypt.org/t/using-the-webroot-domain-verification-method/1445/29
        # Current specification requires "text/plain" or no content header at all.
        # It seems that "text/plain" is a safe option.
        default_type "text/plain";
    
        # Change document root: this path will be given to certbot as the 
        # `-w` param of the webroot plugin.
        root /var/lib/letsencrypt/datadriven-investment.com;
    }

	# Hide /acme-challenge subdirectory and return 404 on all requests.
    # It is somewhat more secure than letting Nginx return 403.
    # Ending slash is important!
    location = /.well-known/acme-challenge/ {
        return 404;
    }

	# redirect from http to https
    location / {
        return 301 https://www.datadriven-investment.com$request_uri;
    }
}

server {
  listen              443 ssl http2;
  server_name		  www.datadriven-investment.com;
  ssl_certificate     /etc/letsencrypt/live/datadriven-investment.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/datadriven-investment.com/privkey.pem;
  ssl_protocols       TLSv1 TLSv1.1 TLSv1.2;
  ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !MEDIUM";
  ssl_prefer_server_ciphers on;

  location / {
     proxy_set_header Host $host;
     proxy_set_header X-Forwarded-For $remote_addr;
     proxy_set_header X-Forwarded-Proto $scheme;
     proxy_send_timeout         90s;
     proxy_read_timeout         90s;
     proxy_pass http://datadriven-investment-loadbalance;
     
     proxy_cache my_cache;
     add_header X-Proxy-Cache $upstream_cache_status;
  }
  # No acme ACME Challenge stuff here: Let's Encrypt API uses HTTP for validation
}

server {
  listen              443 ssl http2;
  server_name		  datadriven-investment.com;
  ssl_certificate     /etc/letsencrypt/live/datadriven-investment.com/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/datadriven-investment.com/privkey.pem;
  ssl_protocols       TLSv1 TLSv1.1 TLSv1.2;
  ssl_ciphers "EECDH+ECDSA+AESGCM EECDH+aRSA+AESGCM EECDH+ECDSA+SHA384 EECDH+ECDSA+SHA256 EECDH+aRSA+SHA384 EECDH+aRSA+SHA256 EECDH+aRSA+RC4 EECDH EDH+aRSA RC4 !aNULL !eNULL !LOW !3DES !MD5 !EXP !PSK !SRP !DSS !MEDIUM";
  ssl_prefer_server_ciphers on;

  # redirect from http to https
    location / {
        return 301 https://www.datadriven-investment.com$request_uri;
    } 
  # No acme ACME Challenge stuff here: Let's Encrypt API uses HTTP for validation
}

Originally published at Datadriven-investment.com.