Stories by Keet Malin Sugathadasa on Medium

EKS Cluster Network Architecture for Worker Nodes

Keet Malin Sugathadasa — Mon, 29 Jul 2024 12:41:13 GMT

AWS EKS (Elastic Kubernetes Service) is Amazon’s managed Kubernetes Service, which works like magic once provisioned. But this would be the default setup of EKS. What if you intend to customize it according to your organization's designs, compliance standards, and privacy requirements? This is where things get complicated.

I want to share my experience and issues I ran into when building a fully private EKS cluster, meaning it can only be accessed via the VPC with no internet access. Based on the textbook guidelines, I provisioned the private EKS cluster in a private VPC, and when I tried to attach nodes to my cluster, BOOM!!! it started giving networking errors. Amazon always tries to provide a very user-friendly interface to provision resources in AWS. Still, some subtle attributes in their setup can burn up your entire week, if you don’t fully understand how it works under the hood.

Today, let me explain how the EKS network is set up, maybe start with a public EKS setup and dive deep into each component and how we can achieve a fully private EKS cluster.

Two VPCs for each EKS Cluster

An EKS cluster consists of 2 VPCs. The first VPC is managed by AWS where the Kubernetes Control Plane resides within this VPC (this cannot be seen by the users). The second VPC is the customer VPC which we specify during the cluster creation. This is where we place all the worker nodes.

2 VPCs for EKS — AWS Managed VPC and Customer Managed VPC

Cluster Endpoint Access Types

The cluster endpoint configures how the Kubernetes API server can be accessed.

Public: The cluster endpoint is accessible from outside of your VPC (Customer Managed VPC). Worker node traffic will leave your VPC (Customer Managed VPC) to connect to the endpoint (in the AWS Managed VPC).
Public and private: The cluster endpoint is accessible from outside of your VPC (Customer Managed VPC). Worker node traffic to the endpoint will stay within your VPC (Customer Managed VPC).
Private: The cluster endpoint is only accessible through your VPC. Worker node traffic to the endpoint will stay within your VPC.

Public Endpoint Only

This is the default behavior of the EKS Cluster. Access to the public endpoint can be controlled with the Security Group allowing only known IP ranges to access the EKS control plan. Anyone accessing the EKS Cluster from outside (eg: using kubectl), will enter through the public endpoint, pass the security group rules and access the control plane.

Any traffic originating from the VPC (eg: worker nodes trying to communicate with the EKS control plane), would leave the VPC, pass the Security Group rules, and access the control plane. Even though the traffic leaves the VPC, it does not leave the AWS network.

For these nodes to connect to the EKS Control Plane, it at least requires one of the following:

1. Public IP address and a route to an Internet Gateway — (where nodes reside in public subnets)

2. NAT Gateway (which already has a public IP address) — (where nodes are in a private subnet and the NAT Gateway is in a public subnet)

Public and Private Endpoints

This option allows the public endpoint as explained above, but the Customer Managed VPC traffic (eg: worker nodes trying to connect to the EKS control plane) will go through the EKS-managed Elastic Network Interface (ENI) through a private endpoint.

This situation is ideal if you’d like to allow your cluster to be accessible via the internet, but you’d like to allow your worker nodes to be in a private subnet and communicate with the EKS control plane through a private endpoint.

Private Endpoint Only

This is the most secure option. But it doesn't mean that the others are insecure. With the right configurations, every setup can be made secure. With this setup, the worker nodes will talk to the EKS control plane via the EKS-managed ENI.

If you wish for someone to access the EKS cluster with kubectl, you can allow that to be done from within the VPC. No external traffic will be allowed into the cluster.

What Happens when we provision a Worker Node?

When we request a new node, it has to do the following.

A new EC2 instance spin-up
Install Kubelet and Kubernetes Node Agent as part of the boot process on each node
Kubelet reaches out to the EKS Control Plane to register the node.
Kubelet receives API commands from the control plane and regularly sends updates to the control plane on node status, capacity, etc.

VPC Configurations for EKS

Now that you understand how these different endpoint types work, let’s take a deeper look into different ways we can configure our customer-managed VPC.

The VPC networking is made up of Subnets and the specified networking configurations like routing, to control the traffic flows. So, there are different ways of configuring your VPC, and let’s take a look at each of them. In this, I’ll also explain the problem I faced with fully private VPC setups.

A VPC is made up of subnets which can either be public or private. We have the following network combinations possible in an EKS setup.

Public Subnets only
Public and Private Subnets
Private Subnets only

Public Subnets only

In this setup, all the resources like load balancers and worker nodes will be installed into public subnets. This means, all these resources are accessible from the internet. I mean, they are accessible, but controlled with the right configurations.

When provisioning worker nodes within the public subnets, each EC2 instance (worker node) will be assigned a public IP on launch. This limits the number of nodes as the number of IP addresses is limited in a given network.

With this setup, you could use any cluster endpoint setup for your EKS cluster.

Public and Private Subnets

This is the widely used VPC setup for EKS, where the worker nodes reside within the private subnets and the NAT Gateway and load balancers are placed within the public subnet.

Private Subnets Only

This is what we call a fully private VPC. There is no egress traffic nor ingress traffic to/from the VPC. For this setup, only the private cluster endpoint should be enabled for your EKS cluster.

This is an uncommon architecture, but can be seen in organizations where data is super sensitive, like banks, hospitals, etc.

The important thing to note here is, that we can easily set up an EKS with a cluster endpoint and allow your worker nodes to communicate with the EKS control plane. This is managed by EKS with their EKS-managed ENIs. But for the EKS nodes to spin up, they would require access to a few other AWS services. This is a common mistake that everyone makes when provisioning a fully private VPC for EKS.

EKS nodes require access to the following AWS service in general, to be able to function within EKS

Amazon ECR (pull-down container images)
Amazon EC2
Amazon S3
Cloudwatch logs
Amazon STS (for IRSA)

Without the above VPC endpoints, you could never create a worker node for EKS, because it requires access to these above AWS services to function as an EKS worker node. (Please note that in certain cases, you might not need all of the above. But it’s always good to have the above).

For example, ECR access is essential and that is the only place where docker images can be pulled down from a fully private VPC setup. Without this, your nodes would not be able to install the initial cluster addons and function as an EKS worker node.

That’s it. Hope this article helps someone to understand and troubleshoot any EKS-related issues and provision a much better architecture. I will do another write-up on how to provision these types of clusters using Terraform. Stay tuned.

Troubleshooting DNS issues with dig

Keet Malin Sugathadasa — Sun, 27 Aug 2023 18:41:11 GMT

Domain Name Systems are responsible for translating human-readable hostnames into IP addresses of backend servers. It sounds straightforward, but in reality, this process is complicated underneath and there are many hops for a request until it reaches the destination.

To continue with this article, you would need a good understanding of how DNS works, its caching, nameserver, and record types.

Let’s go through some of the basic commands first and dive deep into troubleshooting certain scenarios using these commands.

Installing dig

First of all, let’s see whether these tools are already installed on your Linux machine. You could also install these on your Windows machine, but it is beyond the scope of this article.

Let’s install dnsutils which comes with both dig , nslookup and some other dns tools.

> apt install dnsutils

To verify the installation run the following command:

> dig -v
DiG 9.10.6

Basic commands for dig

How to use the dig command

dig <@dns-server>

<@dns-server> → Default is the local DNS server given in /etc/resolve.conf . But we can also specify a server with the @ sign
→ Domain or host address to be queried
→ is one of (a,any,mx,ns,soa,hinfo,axfr,txt,…) [default:a]
→ is one of (in,hs,ch,…) [default: in]
→ query options start with - . Some common query options are

-b address[#port]   (bind to source address/port)
-p port             (specify port number)
-x dot-notation     (shortcut for reverse lookups)

→ is of the form +keyword[=value]. Some common display options are

 +[no]additional     (Control display of additional section)
 +[no]answer         (Control display of answer section)
 +[no]authority      (Control display of authority section)
 +[no]cl             (Control display of class in records)
 +[no]comments       (Control display of comment lines)
 +[no]expire         (Request time to expire)
 +[no]fail           (Don't try next server on SERVFAIL)
 +[no]question       (Control display of question section)
 +[no]short          (Display nothing except short form of answer)
 +[no]trace          (Trace delegation down from root [+dnssec])

Understanding the output of dig

➜  ~ dig google.com

; <<>> DiG 9.10.6 <<>> google.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39959
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;google.com.   IN A

;; ANSWER SECTION:
google.com.  200 IN A 74.125.68.138
google.com.  200 IN A 74.125.68.102
google.com.  200 IN A 74.125.68.113
google.com.  200 IN A 74.125.68.139
google.com.  200 IN A 74.125.68.101
google.com.  200 IN A 74.125.68.100

;; Query time: 67 msec
;; SERVER: 172.20.10.1#53(172.20.10.1)
;; WHEN: Sun Aug 27 18:58:34 +0530 2023
;; MSG SIZE  rcvd: 135

Let’s break down each section and try to understand the response.

Lines starting with ; are comments. They do not include the actual DNS server details.
; <<>> DiG 9.10.6 <<>> google.com → shows the dig version and the query we entered
The HEADER section is the response it received from the DNS server. The flags refer to the answer section.

;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 39959
;; flags: qr rd ra; QUERY: 1, ANSWER: 6, AUTHORITY: 0, ADDITIONAL: 1

The OPT PSEUDOSECTION displays advanced data.

EDNS — Extension system for DNS, if used
Flags — blank because no flags were specified
UDP — UDP packet size

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096

The QUESTION SECTION displays the query that was sent.

google.com. → domain name queried

IN → type of query (IN: Internet)

A → DNS record type

;; QUESTION SECTION:
;google.com.   IN A

The ANSWER SECTION is what gives the relevant records

google.com. → domain name queried

200 → this is the TTL for each record in seconds

IN → type of query (IN: Internet)

A → DNS record type

74.125.68.138 → the IP address associated with the domain name

;; ANSWER SECTION:
google.com.  200 IN A 74.125.68.138
google.com.  200 IN A 74.125.68.102
google.com.  200 IN A 74.125.68.113
google.com.  200 IN A 74.125.68.139
google.com.  200 IN A 74.125.68.101
google.com.  200 IN A 74.125.68.100

The STATISTICS section shows metadata about the query

Query time — The amount of time it took for a response
SERVER — The IP address and port of the responding DNS server. You may notice a loopback address in this line — this refers to a local setting that translates DNS addresses. Usually found in /etc/resolv.conf
WHEN — Timestamp when the command was run
MSG SIZE rcvd — The size of the reply from the DNS server

;; Query time: 67 msec
;; SERVER: 172.20.10.1#53(172.20.10.1)
;; WHEN: Sun Aug 27 18:58:34 +0530 2023
;; MSG SIZE  rcvd: 135

Troubleshooting with dig

Check if the server is reachable

In the HEADER SECTION , the status shows the status of the DNS or backend server.

NOERROR — Everything’s cool. The zone is being served by the requested authority without issues.
SERVFAIL — The name that was queried exists, but there’s no data or invalid data for that name at the requested authority.
NXDOMAIN — The name in question does not exist, and therefore there is no authoritative DNS data to be served.
REFUSED — Not only does the zone not exist at the requested authority, but their infrastructure is not in the business of serving things that don’t exist at all.

Let’s try out a few examples

➜  ~ dig google.com

....
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46275
....

➜  ~ dig keetmalin.com

....
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 58183
....

Specify a DNS resolver for a query

This helps you understand how a DNS is resolved by a public DNS resolver. By default, dig uses the local configuration given in /etc/resolv.conf to decide which nameserver to query. These are known as public DNS resolvers. Some examples are

cloudflare: 1.1.1.1
google: 8.8.8.8
quad9: 9.9.9.9

➜  ~ dig @1.1.1.1 google.com

; <<>> DiG 9.10.6 <<>> @1.1.1.1 google.com
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 18888
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;google.com.   IN A

;; ANSWER SECTION:
google.com.  114 IN A 142.251.37.46

;; Query time: 200 msec
;; SERVER: 1.1.1.1#53(1.1.1.1)
;; WHEN: Sun Aug 27 20:32:08 +0530 2023
;; MSG SIZE  rcvd: 55

Verify whether the DNS reaches the final server IP

This is probably the most common use case of dig. Let’s see the final server IP address of keetmalin.medium.com. You can also do the same for your domains and see if it lists your relevant backend servers.

As you can see below, keetmalin.medium.com points to 2 backend servers 162.159.152.4 and 162.159.153.4.

➜  ~ dig keetmalin.medium.com

; <<>> DiG 9.10.6 <<>> keetmalin.medium.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 55713
;; flags: qr rd ra; QUERY: 1, ANSWER: 2, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;keetmalin.medium.com.  IN A

;; ANSWER SECTION:
keetmalin.medium.com. 377 IN A 162.159.153.4
keetmalin.medium.com. 377 IN A 162.159.152.4

;; Query time: 77 msec
;; SERVER: 172.20.10.1#53(172.20.10.1)
;; WHEN: Sun Aug 27 21:42:46 +0530 2023
;; MSG SIZE  rcvd: 81

Trace hostname A records and DNS name server records

For the sake of this article, let’s do this with a single command. If I summarise the journey of a request in DNS, it goes through the following

DNS recursive resolver: This is the first stop in a DNS query. The recursive resolver acts as a middleman between a client and a DNS nameserver. It accepts a hostname from the client, checks with DNS servers and returns an IP address to the client.
DNS root nameserver: The 13 DNS root nameservers are known to every recursive resolver, and they are the first stop in a recursive resolver’s quest for DNS records. A root server accepts a recursive resolver’s query which includes a domain name, and the root nameserver responds by directing the recursive resolver to a TLD nameserver, based on the extension of that domain (.com, .net, .org, etc.).
TLD nameserver: A TLD nameserver maintains information for all the domain names that share a common domain extension, such as .com, .net, or whatever comes after the last dot in a URL. This nameserver responds by directing the resolver to an authoritative nameserver.
Authoritative nameserver: The authoritative nameserver is usually the resolver’s last step in the journey for an IP address. This returns an A record, or a CNAME record pointing to an alias A record. However it is done, the resolver should get an IP address if it exists.

Now let’s debug this with +trace , for the hostname: keetmalin.medium.com.

➜  ~ dig keetmalin.medium.com +trace

; <<>> DiG 9.10.6 <<>> keetmalin.medium.com +trace
;; global options: +cmd
.   2745 IN NS a.root-servers.net.
.   2745 IN NS c.root-servers.net.
.   2745 IN NS j.root-servers.net.
.   2745 IN NS m.root-servers.net.
.   2745 IN NS g.root-servers.net.
.   2745 IN NS e.root-servers.net.
.   2745 IN NS d.root-servers.net.
.   2745 IN NS h.root-servers.net.
.   2745 IN NS l.root-servers.net.
.   2745 IN NS i.root-servers.net.
.   2745 IN NS b.root-servers.net.
.   2745 IN NS f.root-servers.net.
.   2745 IN NS k.root-servers.net.
;; Received 239 bytes from 172.20.10.1#53(172.20.10.1) in 42 ms

com.   172800 IN NS e.gtld-servers.net.
com.   172800 IN NS b.gtld-servers.net.
com.   172800 IN NS a.gtld-servers.net.
com.   172800 IN NS d.gtld-servers.net.
com.   172800 IN NS i.gtld-servers.net.
com.   172800 IN NS f.gtld-servers.net.
com.   172800 IN NS j.gtld-servers.net.
com.   172800 IN NS k.gtld-servers.net.
com.   172800 IN NS c.gtld-servers.net.
com.   172800 IN NS g.gtld-servers.net.
com.   172800 IN NS h.gtld-servers.net.
com.   172800 IN NS l.gtld-servers.net.
com.   172800 IN NS m.gtld-servers.net.
com.   86400 IN DS 30909 8 2 E2D3C916F6DEEAC73294E8268FB5885044A833FC5459588F4A9184CF C41A5766
com.   86400 IN RRSIG DS 8 1 86400 20230909050000 20230827040000 11019 . Jg5GfQXXzS37hg6eMjYtwr4Sq7L7ojgJS2bsLN8/zxv8K3i+R4Lj8v9j nfUT5v2MImm1rel0y0NLfNMGuJhawllnCWrgYoGPbm6+lZugixjLkm/7 XWcR4/vXDwBDChPp7+wBH5K97yk+b8NraN4/F/J5Xf+PuJKdoKLH5pSn vxkt1pf25fqvfAhiofiJRvrZ6ayVJf0p5svBArSgDvB+YdbV/x6AA5PD d7tkNNR4QCGLAtzj5hRKe5UJUTsFmXuohlNEnSo+tPUboNAhWNfGEXez JgCkZjfFQOB4UlnYcwAit4ocWYj19o3Voa5iUXg4FCQnXoZ0ETdxMLQ9 X6pduA==
;; Received 1180 bytes from 192.203.230.10#53(e.root-servers.net) in 182 ms

medium.com.  172800 IN NS kip.ns.cloudflare.com.
medium.com.  172800 IN NS alina.ns.cloudflare.com.
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN NSEC3 1 1 0 - CK0Q2D6NI4I7EQH8NA30NS61O48UL8G5  NS SOA RRSIG DNSKEY NSEC3PARAM
CK0POJMG874LJREF7EFN8430QVIT8BSM.com. 86400 IN RRSIG NSEC3 8 2 86400 20230903042431 20230827031431 4459 com. OROhP15YNzk5K7aj9L7UbApWHWLqii/AeMOR8oZdBfKYWk3+wmtoeD+R PjYwx4pOLvcrxaA6nV/kCkdNyU1cPv1434vMHkrjbarCx+Ri+tzpEByX jtTLGNzY3XAF3MoyU+nbYqM0PNRCgujYUV+PW3AR2J4UxQviKuyrCAaO 0cbrds7Qw8bxSwGHRz3F2NbkHc0D9gNjkNBFSCvo+8ttxw==
78A5DP9D1TN3VTQRVL40V82OTKSFKVFP.com. 86400 IN NSEC3 1 1 0 - 78A5S0E2MBIKEB8RG0QO9ULGBOQQFBCP  NS DS RRSIG
78A5DP9D1TN3VTQRVL40V82OTKSFKVFP.com. 86400 IN RRSIG NSEC3 8 2 86400 20230903045253 20230827034253 4459 com. SSMkDi+YTY/Wn+VrjbfCEMpxTFYflgCbMI/lxhf1UqqiN7kB5BWpf7uR +A0nrovj5mVwKF5b4BN/CzpEfey02Amt0xuXq+XCoecVUtWHNFxJ6Pre IK4v4TnPVKR7qNWw/t295rXiENoc1vRtpPYAGokXc6wFa41+fYZsmg+S 2inELesbkFAQYpXCpLUYk0Mn4J0FbC21gzooRCvlluJByA==
;; Received 914 bytes from 192.43.172.30#53(i.gtld-servers.net) in 170 ms

keetmalin.medium.com. 300 IN A 162.159.152.4
keetmalin.medium.com. 300 IN A 162.159.153.4
;; Received 81 bytes from 173.245.58.61#53(alina.ns.cloudflare.com) in 176 ms

Your DNS recursive resolver is 172.20.10.1 and it knows a set of root-servers. And it queries from e.root-servers.net .

.   2745 IN NS a.root-servers.net.
.   2745 IN NS c.root-servers.net.
.   2745 IN NS j.root-servers.net.
.   2745 IN NS m.root-servers.net.
.   2745 IN NS g.root-servers.net.
.   2745 IN NS e.root-servers.net.
.   2745 IN NS d.root-servers.net.
.   2745 IN NS h.root-servers.net.
.   2745 IN NS l.root-servers.net.
.   2745 IN NS i.root-servers.net.
.   2745 IN NS b.root-servers.net.
.   2745 IN NS f.root-servers.net.
.   2745 IN NS k.root-servers.net.
;; Received 239 bytes from 172.20.10.1#53(172.20.10.1) in 42 ms

Your DNS root nameserver is e.root-servers.net. Since our initial query hostname ended with .com , this root server will direct your resolver to the .com TLD nameservers. In this case it i.gtld-servers.net.

com.   172800 IN NS e.gtld-servers.net.
com.   172800 IN NS b.gtld-servers.net.
com.   172800 IN NS a.gtld-servers.net.
com.   172800 IN NS d.gtld-servers.net.
com.   172800 IN NS i.gtld-servers.net.
com.   172800 IN NS f.gtld-servers.net.
com.   172800 IN NS j.gtld-servers.net.
com.   172800 IN NS k.gtld-servers.net.
com.   172800 IN NS c.gtld-servers.net.
com.   172800 IN NS g.gtld-servers.net.
com.   172800 IN NS h.gtld-servers.net.
com.   172800 IN NS l.gtld-servers.net.
com.   172800 IN NS m.gtld-servers.net.
com.   86400 IN DS 30909 8 2 E2D3C916F6DEEAC73294E8268FB5885044A833FC5459588F4A9184CF C41A5766
com.   86400 IN RRSIG DS 8 1 86400 20230909050000 20230827040000 11019 . Jg5GfQXXzS37hg6eMjYtwr4Sq7L7ojgJS2bsLN8/zxv8K3i+R4Lj8v9j nfUT5v2MImm1rel0y0NLfNMGuJhawllnCWrgYoGPbm6+lZugixjLkm/7 XWcR4/vXDwBDChPp7+wBH5K97yk+b8NraN4/F/J5Xf+PuJKdoKLH5pSn vxkt1pf25fqvfAhiofiJRvrZ6ayVJf0p5svBArSgDvB+YdbV/x6AA5PD d7tkNNR4QCGLAtzj5hRKe5UJUTsFmXuohlNEnSo+tPUboNAhWNfGEXez JgCkZjfFQOB4UlnYcwAit4ocWYj19o3Voa5iUXg4FCQnXoZ0ETdxMLQ9 X6pduA==
;; Received 1180 bytes from 192.203.230.10#53(e.root-servers.net) in 182 ms

Your TLD nameserver is i.gtld-servers.net . This server knows where the medium.com domain (which is the domain for the subdomain keetmalin.medium.com) resides. There are two locations and it picks alina.ns.cloudflare.com, which is an authoritative nameserver.

medium.com.  172800 IN NS kip.ns.cloudflare.com.
medium.com.  172800 IN NS alina.ns.cloudflare.com.
...
;; Received 914 bytes from 192.43.172.30#53(i.gtld-servers.net) in 170 ms

Your Authoritative nameserver is alina.ns.cloudflare.com . And this returns 2 A records, pointing to 2 server IPs. 162.159.152.4 and 162.159.153.4.

If you run the dig command again, you will be able to see that the selected nameservers at each step are different. At each step, since there are multiple nameservers the DNS resolver decides which one to use.

Reverse DNS lookup

This is helpful when you have the IP address and would like to know the domains or subdomains associated with it. For this to work, the network admins of that website should have set up the PTR Records to allow reverse lookups.

As an example, let’s look at sb-in-f101.1e100.net and let’s find the IP of this hostname.

➜  ~ dig sb-in-f101.1e100.net +nocomment +nostat
...
sb-in-f101.1e100.net. 2158 IN A 74.125.130.101

If we take this IP and do a reverse lookup, it should have a PTR record pointing to the above hostname.

➜  ~ dig -x 74.125.130.101 +nocomment +nostat
...
101.130.125.74.in-addr.arpa. 4502 IN PTR sb-in-f101.1e100.net.

In the above example, the reverse lookup status was NOERROR since there was a PTR Record in place. But let’s look at the following IP of keetmalin.medium.com which does not have a PTR.

➜  ~ dig keetmalin.medium.com +nocomment +nostat
...
keetmalin.medium.com. 377 IN A 162.159.152.4
keetmalin.medium.com. 377 IN A 162.159.153.4

➜  ~ dig -x 162.159.153.4
...
;; ->>HEADER<<- opcode: QUERY, status: NXDOMAIN, id: 45635
...

Here you can clearly see that the status is NXDOMAIN , and this is because the PTR record is not in place for this host.

Common DNS record types explained

Keet Malin Sugathadasa — Sat, 26 Aug 2023 12:19:54 GMT

Domain Name Systems (or DNS) is a global system that is responsible for translating human-readable hostnames into their corresponding IP (Internet Protocol) addresses. DNS record types on the other hand are entries that explain how to resolve each hostname. The DNS resolver uses these DNS records to translate the hostname into relevant IP addresses.

In this article, let's take a look at the most common DNS record types and understand when they are used.

What is a DNS record?

DNS records (aka zone files) are simple instructions on how a hostname should be mapped to an IP address. These records live in DNS servers. Each record contains the following attributes (at least).

name: this will be the hostname (e.g: example.com)
value: the IP address or another value based on the record type (e.g. 10.20.30.40)
type: this is the record type (e.g. A record)
TTL: Time To Live, indicates how often a DNS server will refresh that record (e.g. 60 seconds)

Common DNS Record Types

Let’s take a quick look at the following most common DNS record types. This list also covers all the AWS Route53 record types.

A records
AAAA records
CNAME records
NS records
MX records
TXT Records
CERT Records
PTR Records
SRV Records
CAA Records

A Records (Address Records)

This is the most common and most important DNS record type. An A record shows the IPV4 address for a specific hostname or domain. These records reside at the authoritative DNS servers.

Type: A
Domain Name: example.com
IP Address: 10.24.34.44
TTL: 1 hour

AAAA Records

Same as A Records, but these records point to an IPV6 address for a specific hostname or domain.

Type: AAAA
Domain Name: example.com
IP Address: 2001:db8:3333:4444:5555:6666:7777:8888
TTL: 1 hour

CNAME Records (Canonical NAME Records)

These records point a domain name or hostname (aka alias) to another domain name or hostname (aka canonical name). They do not point to an IP address. It is important to note that you can add only one CNAME record per hostname.

This can prove convenient when running multiple services (like an FTP server and a web server, each running on different ports) from a single IP address.

CNAME records usually contain subdomains that point to a domain’s A or AAAA record. This prevents having to create an extra A or AAAA record for each subdomain.

It is not recommended to have CNAME records pointing to other CNAME records, as this creates unnecessary steps in the DNS lookup process.

Type: CNAME
Domain Name (alias): ftp.example.com
Domain Name (canonical name): example.com
TTL: 1 hour

NS Records (Name Server Records)

These records specify an Authoritative DNS server for a domain or a hostname. NS records help find the right DNS server for browsers to find the IP address for a domain name. When a browser is resolving a DNS record, usually it asks from multiple nameservers until it locates the correct Authoritative DNS server to fetch the IP address. Basically, it specifies that a DNS Zone, such as “example.com” is delegated to a specific Authoritative Name Server, and provides the address of the name server.

Type: NS
Domain Name: example.com
Name Server: ns1.example.com
TTL: 1 hour

MX Records (Mail Exchange Records)

These records show where emails for a domain should be routed. This allows traffic following the SMTP protocol, to be routed to their relevant mail servers (mail exchange).

Since mail servers also have backup mail servers, you are allowed to have multiple MX records for the same domain. For this the attribute Priority routes traffic to the primary and backup mail servers. For example, MX record with priority 10 will be the primary mail server, while the secondary server will only be used when the primary server is unavailable (or fails to send emails).

An MX record can only point to a name of an email server. This means that each referenced email server must also have a valid A record specifying its IP address

Type: MX
Domain Name: example-mail.com
Mail Server: mail.example.com
Priority: 10
TTL: 1 hour

TXT Records (Text Records)

Allows administrators to add limited human and machine-readable notes and can be used for things such as email validation, site, and ownership verification, framework policies, etc., and doesn’t require specific formatting.

The TXT record allows you to add and store text-based information about a domain name. There are all kinds of TXT records and some of them people can easily understand, and others are specifically for machines to read.

Type: TXT
Domain Name: example.com
Value: verification=some-server.com (any text you want)
Priority: 10

CERT Records (Certificate Records)

CERT records provide a space for storing certificates and related certificate revocation lists (CRL). The certificates can verify the authenticity of sending and receiving parties, while CRLs identify unauthorized parties.

Type: CERT
Domain Name: example.com (domain name which is being certified)
Value: (Base 64 encoded string of the certificate)
Cert Type: PGP (Defines the type of certificate/CRL used. Eg: PKIX, SPKI, etc)
Algorithm: RSA (algorithm used to produce the certificate/CRL)

PTR Records (Pointer Records)

This provides a domain name for reverse lookup. It’s the opposite of an A record as it provides the domain name linked to an IP address instead of the IP address for a domain.

Type: PTR
IP Address: 10.22.11.40
Value: example.com
TTL: 1 hour

SRV Records (Service Records)

With this, it is possible to store the IP address and port for specific services. It allows services such as instant messaging or VoIP to be directed to a separate host and port location.

Type: SRV
Service: name of the service (eg: xmpp-server)
Value: example.com. (the canonical hostname of the machine providing the service, ending in a dot.)
Protocol: TCP or UDP
TTL: 1 hour
Port: 3333 The TCP or UDP port the service is running on
Priority: 23 (The priority of the target host, lower value means more preferred among same service records)
Weight: 12 (A relative weight for records with the same priority, higher value means more preferred.)

CAA Records (Certification Authority Authorization Records)

This allows domain owners to state which certificate authorities can issue certificates for that domain. If no CAA record exists, then anyone can issue a certificate for the domain. CAA records can set policy for the entire domain, or for specific hostnames.

They are also inherited by subdomains, therefore a CAA record set on domain.com will also apply to any subdomain, such as subdomain.domain.com (unless overridden).

Type: CAA
Domain: example.com (Domain name/Subdomain)
Flag: 0/182 (0 means non-critical, 182 means critical)
Type: issue/issuewild/iode
Value: caa.example.com (The value given from the preferred CA)

Infrastructure Drift Detection with Crossplane

Keet Malin Sugathadasa — Thu, 14 Jul 2022 17:37:29 GMT

Early in the day, infrastructure management tools were called configuration management tools. It is the same concept as IaC (Infrastructure as Code) but bundled differently. Some of the tools we used were CFEngine, Chef, Puppet, Terraform, and Pulumi. Ansible is another configuration management tool that was trending over Chef and Puppet.

All these tools managed to fulfill the infrastructure-related requirements we had back then. Each of these tools had its own engines, sometimes own languages, and own set of APIs. Even though it used to cater to our requirements back then, there are much bigger problems in IaC today. Some of the problems are:

Identifying drift detection (monitor and identify infrastructure changes, where these changes will only be fixed after the subsequent execution of the tool)
Auto synchronization of state (when the current state changes, there is no automatic synchronization to bring the current state to the desired state)
A common API (a common API for services, applications, and infrastructure to deal with the provisioning)
State management (some state objects need to be managed safely and securely giving the burden on the infrastructure engineers)

There are ways to tackle some of these problems and some are inherent problems that we need to live with. But today, the needs are different.

If you have a look at services deployed in Kubernetes, you will see that they are managed by configs (manifests), and maintained by the platform (Kubernetes) itself. If the service drifts away from its desired state, Kubernetes will bring it back to its desired state. If a pod gets killed accidentally, it will be restarted.

But, what if this approach could also be used to maintain infrastructure in your desired platform? Whenever an infrastructure component drifts in configuration, it will be automatically brought back to its desired state.

https://crossplane.io/

This is where Crossplane comes into place. This is the next generation of Infrastructure as Code, which manages auto synchronization, drift detection, and all through the Kubernetes API, just like how we manage our services in Kubernetes.

Imagine someone deleting a Virtual Machine or manually changing something in a cluster. Terraform will not detect this nor fix the drift. Today the need for GitOps is different. It’s not just to store definitions in Git and manage infrastructure, but to detect drift and have automatic synchronization. Even Pulumi doesn’t have this. Although this is not directly supported, there are ways to tackle this problem in other ways.

This article will talk to you about how we can use Crossplane to manage your infrastructure with automatic drift detection and synchronization. In the end, let’s have a look at some examples to deploy a simple cloud resource via Crossplane.

What is Drift Detection

Drift detection enables you to detect whether a stack’s actual configuration differs, or has drifted, from its expected configuration. A drift in service is detected, when certain attributes of a component are different from the desired (expected) state or if the component is deleted.

When it comes to drifting off service or infrastructure, we use the term synchronization to discuss that problem. When the drifting happens, we call it unsynchronized. When the drifting is fixed, it becomes synchronized. Mostly, the synchronization of IaC is done manually. But Crossplane gives you the inherent capability of detecting and handling drift automatically.

What is Crossplane

Crossplane is an open source, CNCF project built on the foundation of Kubernetes to orchestrate anything. Encapsulate policies, permissions, and other guardrails behind a custom API line to enable your customers to self-service without needing to become an infrastructure expert.

Crossplane

Why Crossplane over Terraform

Crossplane is often compared with Terraform, as both of them serve the same purpose, but with different capabilities. Some of the common features are

Manage infrastructure in a declarative language configuration
Support all major cloud platforms via provider plugins
Open Source tools

Terraform is a command line tool (CLI) that provides an interface to a control plane. Crossplane on the other hand is a control plane and uses Kubernetes as the API. Let’s talk about some key problems faced by Terraform which are addressed in Crossplane

State lock and wait time for other collaborators (Strong consistency)

Terraform is a great way to manage infrastructure with declarative configurations and version controlling for a small team of engineers. But this can fall apart when more engineers want to collaborate and maintain the infrastructure. In Terraform, a lock must be held on the state file while the configuration is being applied, and applying a Terraform configuration is a blocking process that can take minutes to complete. During this time, no other engineer can apply changes to the configuration.

Parallel updates are not allowed to different components (monolithic state)

Terraform uses a monolithic state file and this file gets locked for every piece of modification made to the configuration. If engineers need to update separate resources at the same time, they would have to wait for each of them to run, one after the other. Terraform provides an option for scoped configurations, but it is a bit complicated.

Calculate a graph of dependencies and differences to make a change

With each planning, a graph of dependencies is calculated to figure out the sequence of executions. The same happens for the destroy operation as well. When you have an entire production environment in Terraform, it will contain a complex dependency graph and diff calculations. Crossplane on the other hand uses the Crossplane Resource Model (XRM) and promotes loose coupling and eventual consistency. In Crossplane every piece of infrastructure is an API endpoint that supports create, read, update, and delete operations. So you can easily operate on a single database, even if you manage your entire production environment with Crossplane.

Developers need to learn HCL in order to provision resources

When it comes to collaboration and self-service, each user needs to understand and learn the Hashicorp Configuration Language (HCL). Therefore, what happens in organizations is that Terraform is owned and managed by an infrastructure/platform team which enables development teams to use the resources they provide.

Cannot easily grant specific access controls to teams

Access control remains down at the cloud provider API level and not each resource level. (eg: Delete access to RDS and not database x). For example, if the infrastructure team invites the application teams to manage their own databases, they would need group permissions like RDS read access. The Crossplane equivalent of a Terraform module is a Composite Resource — an XR. Each XR is exposed as an API endpoint. In each API, we can enforce Role Based Access Control (RBAC) at the API level. RBAC access can be given to each team’s database: read only for team's database A, rather than having to manage access to various underlying cloud concepts like RDS instances or subnet groups. Because Crossplane builds on the battle-hardened Kubernetes RBAC system, a platform team can easily support many teams of application developers within a single control plane.

Attempts to reconcile your desired state on-demand only (configuration drift)

Terraform is a command line tool, not a control plane. Because it is a short-lived, one-shot process it will only attempt to reconcile your desired configuration with actual infrastructure when it is invoked. Whether run from a CI/CD pipeline or a laptop Terraform is typically invoked only when an engineer expects that infrastructure needs updating. Crossplane, on the other hand, is built as a series of long-lived, always-on control loops. It constantly observes and corrects an organization’s infrastructure to match its desired configuration whether changes are expected or not. When Crossplane has been asked to manage a piece of infrastructure any change made outside it will automatically and persistently be reverted.

Additional overhead in setting up GitOps

Terraform does not have its own API and Terraform works differently from how services are deployed in an organization. Therefore, it needs a specific setup and maintenance of its own. In this process, with version controlling, we would like for our CI/CD pipelines to execute Terraform as part of its pipeline. This is an improvement relative to a team running Terraform from their laptops.

All or nothing provisioning of resources for each apply

The process of applying in Terraform is an “all or nothing” process. If you decide to apply something in Terraform, you need to update all components: cache, state file, and the configuration of infrastructure. This means that if anyone in your organization circumvents Terraform, the next person to trigger a Terraform run will be faced with a surprising plan as it attempts to undo the change. During an incident, if you decide to update the infrastructure without letting Terraform know, that’s gonna open up a can of worms that you would have to deal with later. Hence, configuration around Terraform can be risky.

Terraform has no API. Has to be invoked via command line tools

Integrating Terraform with CI/CD or other scripts becomes challenging as it does not provide an API. Terraform is a command line tool, and each of these scripts needs to execute a command in order to run Terraform. Crossplane on the other hand can be contacted via an API. Whether the team decides to write a Python script or a separate tool, crossplane can be invoked by a simple API call. This API is not a typical REST API. Building on the Kubernetes API means that teams can orchestrate all of their infrastructure — cloud and otherwise — using tools like kubectl. Crossplane can even expose the details an application needs to connect to infrastructure as a Kubernetes secret to ease integration. It can be paired with projects like ArgoCD, Gatekeeper, or Velero to enable GitOps, advanced policy, and backups.

Let’s Try it Out

Given below is a working example for you to try out Crossplane and see how it works.

Step 1: Get access to a Kubernetes cluster

As the first step, you need a Kubernetes cluster. This could be a cloud cluster like AKS or EKS, but you can also do this for free by spinning up a Kubernetes cluster locally via minikube.

Step 2: Install Helm in your cluster

Follow this guide and install Helm.

Installing Helm | Helm

Step 3: Install Crossplane in your cluster

Execute the following commands if you are using Helm 3. Else find the right commands from this guide.

kubectl create namespace crossplane-system

helm repo add crossplane-stable https://charts.crossplane.io/stable
helm repo update

helm install crossplane --namespace crossplane-system crossplane-stable/crossplane

Check the status of Crossplane by

helm list -n crossplane-system

kubectl get all -n crossplane-system

Step 4: Install the Crossplane CLI on your local machine

Run the following command

curl -sL https://raw.githubusercontent.com/crossplane/crossplane/master/install.sh | sh

Run the following command to make it available in your CLI.

sudo mv kubectl-crossplane /usr/local/bin

Step 5: Install the relevant provider

In this example, we will be using AWS as our cloud provider. Run the following command.

kubectl crossplane install provider crossplane/provider-aws:v0.29.0

Check whether the installation is healthy.

kubectl get providers

Step 6: Creating a provider config to grant access to the provider

Here, we will grant access to AWS. For crossplane to talk to AWS, we need to give it the credentials. We do that through the provider config.

Run aws configure and give the AWS access key and secret in your default profile. Next, run the following command.

AWS_PROFILE=default && echo -e "[default]\naws_access_key_id = $(aws configure get aws_access_key_id --profile $AWS_PROFILE)\naws_secret_access_key = $(aws configure get aws_secret_access_key --profile $AWS_PROFILE)" > creds.conf

Next, create a secret in Kubernetes and store the credentials.

kubectl create secret generic aws-creds -n crossplane-system --from-file=creds=./creds.conf

Next, let’s apply the following manifest into Kubernetes. This will create a provider config.

apiVersion: aws.crossplane.io/v1beta1
kind: ProviderConfig
metadata:
  name: default
spec:
  credentials:
    source: Secret
    secretRef:
      namespace: crossplane-system
      name: aws-creds
      key: creds

To apply the above manifest to Kubernetes, use the following command.

kubectl apply -f https://raw.githubusercontent.com/crossplane/crossplane/release-1.8/docs/snippets/configure/aws/providerconfig.yaml

Step 7: Create an AWS SQS resource

Now we are almost there. Apply the following manifest to create an SQS queue in AWS. Make sure your credentials have enough access to create the AWS resource. If you want, get yourself familiarized with the API doc for SQS.

apiVersion: sqs.aws.crossplane.io/v1beta1
kind: Queue
metadata:
    name: my-sqs2
    namespace: default
spec:
    forProvider:
        region: us-east-1

Now, go to the AWS console, and see whether this resource is created.

Step 8: Check for your resources via kubectl

Run the following command.

kubectl get queue

That’s it. We are done.

Step 9: Creating other resources

Check out the following documentation of your provider to understand how to create other resources in AWS. Simply follow the same steps as above. Create your manifest for the resource and apply it. Always make sure to grant the necessary permissions to your credentials.

crossplane/provider-aws

References

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Developers: Learn and grow by keeping up with what matters, JOIN FAUN.

Infrastructure Drift Detection with Crossplane was originally published in FAUN.dev() 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

All you need to know about Terraform

Keet Malin Sugathadasa — Sun, 10 Jul 2022 06:18:45 GMT

Terraform is an infrastructure as code (IaC) tool that lets you define both cloud and on-prem resources in human-readable configuration files that you can version, reuse, and share. It is a free and open-source tool that installs as a single binary to create, manage and destroy resources in a matter of minutes.

You can then use a consistent workflow to provision and manage all of your infrastructure throughout its lifecycle. Terraform can manage low-level components like compute, storage, and networking resources, as well as high-level components like DNS entries and SaaS features.

This article describes the concepts and basics of Terraform and how to use it. This will also help you quickly ramp up on Terraform, covering all fundamentals even required for tech interviews.

https://www.terraform.io/intro

How does Terraform work

Terraform creates and manages resources on cloud platforms and other services through its application programming interfaces (APIs). Terraform has the ability to deploy across multiple platforms.

Terraform connects with other platforms through providers (terraform plugins), which are programmed to interact with the provider APIs.

Terraform allows you to provision resources ranging from

cloud infrasture
networking tools
monitoring tools
databases
version control systems
and much more

Terraform uses its own language called HCL (Hashicorp Configuration Language). This is a declarative language and every configuration file (written in HCL) requires the .tf file extension. These configuration files contain code describing the desired infrastructure for our requirements. This is also called the desired state for our infrastructure. What Terraform does is, it brings the current state of the infrastructure to the desired state, as defined in the configuration files.

Terminology

configuration files: these are files written in HCL with the extension .tf . These files describe the desired state of the infrastructure
desired state (target state): how our final infrastructure setup should look like
target environment → the environment in which our resources that is being maintained by Terraform are running in (eg: AWS, Local, Docker…)
resource → every object that Terraform manages is called a resource. This can be any database, cloud resource, or even a physical resource on-premise

3 Phases in Terraform

init → initializes the project and identifies the providers required for the target environments
plan → drafts a plan to get to the target state. This shows a report on the required changes
apply → makes the necessary changes in the target environment to bring it to the desired state.

The apply phase is what makes the actual change in the target environment. If our resources deviate from the desired state, the subsequent apply calls will bring them back to the desired state, by only updating the deviated resources.

For some reason, if the environment deviates from the desired state, a subsequent apply will bring it back to the desired state by only fixing the missing component.

Lifecycle of a Resource

This is about the lifecycle of a resource which is managed by Terraform. Resources have a strict lifecycle and can be thought of as basic state machines. A resource roughly follows the steps below

ValidateResource → is called to do a high-level structural validation of a resource’s configuration.
Diff → is called with the current state and the configuration. The resource provider inspects this and returns a diff, outlining all the changes that need to occur to the resource
Apply → is called with the current state and the diff. Apply does not have access to the configuration. This is a safety mechanism that limits the possibility that a provider changes a diff on the fly.

Once Terraform applies the changes to the target environment, it records the state of the infrastructure as it is seen in the real world. This is recorded in a file named terraform.tfstate . The state is a blueprint of the infrastructure deployed by Terraform. We will learn more about this in the sections below.

Terraform can also import other resources created either manually or by other IAC tools. And bring it under its control. So that it can manage those resources going forward.

Terraform Cloud and Enterprise

Terraform Cloud and Terraform Enterprise are different distributions of the same application. Terraform Cloud is an application that helps teams use Terraform together. It manages Terraform runs in a consistent and reliable environment, and includes easy access to shared state and secret data, access controls for approving changes to infrastructure, a private registry for sharing Terraform modules, and detailed policy controls for governing the contents of Terraform configurations, and more.

Enterprises with advanced security and compliance needs can purchase Terraform Enterprise, our self-hosted distribution of Terraform Cloud. It offers enterprises a private instance that includes the advanced features available in Terraform Cloud.

Both of these provide the following benefits

better collaboration
improved security
a centralized UI to manage deployments

Installing Terraform

You can use the following link to download and install Terraform to your local machine.

Downloads | Terraform by HashiCorp

Basics of HCL

    key1 = value1 (these are arguments)

    key2 = value2

block → contains the infrastructure platform and the resource it wants to create. (eg: resource, variable, data etc)

See the following example:

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file"
}

block name → resource
resource type → local_file (local = provider name which gets downloaded as a plugin from terraform init ; file = resource). Always the first part before the underscore (_) is the provider. The remaining part is the resource.
provider → local
resource → file (a component of the provider)
resource name → test
arguments → filename and content . Which are inside the block (curly braces {} )

Terraform workflow

write the configuration file
run terraform init → downloads the provider plugin
review the execution plan using the terraform plan command
then apply via terraform apply command

During a terraform plan you will receive a report. This indicates which resources will be created, updated, or destroyed.

plus (+) symbol in the plan indicates that the resource will be created.
plus and minus (-/+) symbol in the plan indicates that the resource will be destroyed and created.
minus (-) symbol in the plan indicates that the resource will be destroyed.

Updating Resources

If we try to update a configuration file and then run a terraform apply, the resource will be deleted and created. This is because the infrastructure is immutable. This will be explained in a later section.

Terraform Providers

There are 3 tiers of providers

official providers — owned and maintained by HashiCorp. Includes the major cloud providers.
verified provider — this is owned by a 3rd party technology company that has gone through a partner provider process with HashiCorp. (Heroku, digital ocean, etc)
community provider — published and maintained by individual contributors in the HashiCorp community

The provider plugin name format

There are two formats.

organizational namespace / type (name of the provider)

eg: hashicorp/local → (this is the source address identifier used by Terraform to locate and download from the registry).

hostname / organizational namespace / type (name of the provider)

eg: registry.terraform.io/hashicorp/local

Provider plugin versions

When we execute terraform init , these provider plugins get downloaded into the .terraform/plugins folder. With this command, we can see which plugin version has been downloaded.

We can lock down our configuration files to use a specific version of the provider plugins. New plugin versions may break the code sometimes.

For this, we need to add an additional code block called terraform. If we don’t specify this explicitly, Terraform downloads the latest version of the plugin.

Specific Version

terraform {
   required_providers {
      local = {
         source = "hashicorp/local"
         version = "1.4.0"
      }
   }
}

Should not be a specific version. (This will use the latest version or previous version of this if this is the latest version)

terraform {
   required_providers {
      local = {
         source = "hashicorp/local"
         version = "!= 1.4.0"
      }
   }
}

A version before a specified version or a version after a specified version.( Can use < or >)

terraform {
   required_providers {
      local = {
         source = "hashicorp/local"
         version = "< 1.4.0"
      }
   }
}

Stricter version constraints with a specific range

terraform {
   required_providers {
      local = {
         source = "hashicorp/local"
         version = "> 1.2.0, < 2.0.0 , != 1.4.0"
      }
   }
}

An incremental version update is allowed. For ~>1.4 the maximum allowed version is 1.9 . For ~>1.4.1 the maximum allowed version is 1.4.9

terraform {
   required_providers {
      local = {
         source = "hashicorp/local"
         version = "~> 1.4"
      }
   }
}

Multiple Providers

We can maintain multiple providers in the same configuration file. Whenever we add/remove a provider or make a version update, we need to run the terraform init command as this command downloads the necessary plugins for the providers.

Configuration directory

This is where we have the terraform files. You can have any number of .tf files in this directory. This can be considered as the working directory for running Terraform commands. Types of files:

main.tf : main configuration file containing resource definition
variables.tf: variable declarations
outputs.tf: contains outputs from resources
provider.tf: contains provider definitions

Defining variables (input variables)

Variables are defined in the variables.tf file. See the structure of this file:

variable  {

    default =

This variable can be referenced by var. .

A variable block has three parameters all of which are optional.

default: the default value, if a value is not set
description: describing the purpose and use of the variable
type: string, number, bool, any (default), list, map, set, object, tuple

string, number, bool, any → are variable types

list, map, set, object, tuple → are data types

list data type

default=["Mr","Mrs"]

for lists, we can put type constraints

type=list(string)

type=list(number)

Map data type

default={
   "key1"="value1"
}

to access this,

var.map[“key1”]

for maps, we can put type constraints for the values. The keys will always be strings.

type=map(string)

type=map(number)

Set data type

This is similar to lists. Can’t have duplicate elements. Can have type constraints

type=set(string)

type=set(number)

Object data type

complex data structures with a mix of variables and data types.

variable "bella" = {
  type=object({
         name=string
         food=list(string)
  })
}

Tuple data type

tuples are similar to lists. but can have different variable types.

type=tuple([string, number, bool])
default=["car", 7, true]

Injecting values to Input Variables

if we don’t enter default values to the input variables, the values for the variables will be prompted during the terraform apply command. (interactive)

If not. we can pass command-line flags in the terraform apply command.

Using -var flag

terraform apply -var "filename=name" -var "content=this is a file"

Using environment variables

export TF_VAR_filename="name"
export TF_VAR_content="this is a file"

terraform apply

Using variable definition files

If we are dealing with a lot of variables, we can use terraform.tfvars variable definition files. The file name should end with .tfvars or .tfvars.json . Example file content:

filename="name"
content="this is a file"

The following file formats will be automatically loaded by Terraform

terraform.tfvars
terraform.tfvars.json
*.auto.tfvars
*.auto.tfvars.json

Other files need to be manually loaded via

terraform apply -var-file variables.tfvars

Input variable option precedence

If we use multiple ways to assign values to the same variable name, there is a precedence. Variable definition precedence

-var or -var-file (highest precedence)
*.auto.tfvars or *.auto.tfvars.json (in alphabetical order)
terraform.tfvars
Environment variables (lowest precedence)

Reference Expressions (using the result of a resource in another resource)

Use the output from one resource as the input for another. For each resource type, see what attributes are given as outputs. (mostly id’s). We can reference by

..attribute

Eg:

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file ${random_pet.mypet.id}"
}

This is a test file ${random_pet.mypet.id} → This is called a reference expression because it references the output of another resource named mypet . With this, there is an implicit dependency assigned as well. The resource test will not be created until the resource mypet is created. This is because the result mypet is required to create the content of the resource test .

Resource Dependencies

If the output of one resource is used in another, the order of execution is determined by Terraform itself. When resources are deleted, terraform deletes in the reverse order. This is called implicit dependency.

But we can also specify the dependencies via explicit dependency. We should use the depends_on argument for this

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file"

   depends_on=[random_pet.mypet]
}

Terraform output variables

To save the output in a variable name, we use the output block

output "" {
   value=""
   ...arguments
}

Example:

output pet-name {
   value=random_pet.my-pet.id
   description="some description"
}

When we run terraform apply, we can see that the output variable is printed on the screen. This is one advantage of using output variables.

To specifically print these output variables, we can also use terraform outputcommand. To print a specific value terraform output

The best use case for output variables is to quickly display output variables on the screen. Or can be given out to other scripts for testing or configuration management (eg: to Ansible or shell scripts)

Terraform State

This is a JSON file by default. terraform apply is the command that will update the state. terraform plan will only read from the state.

Terraform makes execution plans when there is a difference between the current state (state file) and the desired state in the (configuration files). When terraform creates a resource, it records its identity in the state. Each resource will have a unique ID.

The state file also tracks metadata details such as resource dependencies. This is useful when resources need to be deleted to identify which order. When destroying resources, the relevant resource configuration will be deleted from the configuration files. Hence, it is impossible to identify the dependency from the deleted code blocks. So, the state file can be used to get this information to determine the destroy order.

The terraform state file has the following name: terraform.tfstate .

Another advantage of the state is its performance. If this didn't exist, terraform would have to compare with the real-world infrastructure and the resources configuration files. Instead, the state is used as the record of truth.

Terraform stores a cache of attribute values for all resources in the state. We can specify terraform to use the state file alone when running commands and bypass having to refresh the state every time. Use the following command for this.

terraform plan -refresh=false

Another benefit is collaboration when working as a team. When we run this locally, the terraform.tfstate file is generated locally. For better collaboration with a team, it is highly recommended to store this in a remote location allowing the state to be shared. A few remote locations for state storage are:

AWS S3
Google cloud storage
HashiCorp Consul
Terraform Cloud

We can’t get rid of the state file. It is a mandatory file.

Statefile has sensitive information

This file contains sensitive information of resources. This includes IP address, hostname, and even file contents. Therefore it needs to be handled with care.

To make the Terraform configuration files shareable and versioned, it is recommended to store these files in a distributed version controlling system (VCS). Eg: Github, Gitlab, Bitbucket.

But since the state file is sensitive, it should not be saved in a git repository. Instead, save it in a remote backend storage like S3.

If you want to update the state file, always use terraform state commands instead of doing it manually.

Mutable and Immutable infrastructure

When the underlying infrastructure remains but a config or software can be updated: we call it Mutable infrastructure.

During this process, if two servers get updated and one fails, we call this a configuration drift. The 3rd one cannot be updated due to an underlying issue. This can make it difficult to plan and carry out subsequent updates

A solution would be to create a new server with the configuration and delete the old server. This is an immutable infrastructure. Immutability makes it easier to roll back and roll forward between versions. It is simply a matter of recreating. Terraform uses this approach. Terraform destroys and creates new resources. But if we want the other way around, we should use lifecycle rules in that resource block.

Lifecycle Rules

syntax of a resource block with a lifecycle rule is given in the following examples.

create_before_destroy

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file"

lifecycle {
      create_before_destroy = true
   }
}

prevent_destroy

prevent_destroy: terraform will reject any request for this resource to be destroyed during changes made to the configuration with a subsequent apply. But terraform destroy command will still destroy these resources even if this lifecycle rule was set.

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file"

lifecycle {
      prevent_destroy = true
   }
}

ignore_changes

ignore_changes rule will prevent from the resource being updated based on a set of attributes we specify (eg: tags).

For example, if we make this change manually in the cloud, and then run terraform apply, this will try to revert it back as per the configuration. But if we want to ignore it, we can use this command. So if someone changes the tags manually, those changes will be ignored in the subsequent apply commands.

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file"

lifecycle {
      ignore_changes = [tags]
   }
}

or you can ignore changes in all attributes by

resource "local_file" "test" {
   filename = "/user/test.txt"
   content = "This is a test file"

lifecycle {
      ignore_changes = all
   }
}

Data Sources

This allows Terraform to read resources provisioned outside of its control. (For example, if a shell script creates a resource, or is defined by AWS). If the following file was created manually or by a separate resource provisioning tool, we can make it available as a local_file for Terraform to read.

data "local_file" "test" {
   filename = "/user/test.txt"
}

You can have a look at the data sources documentation of the relevant provider to learn more. To use the above data source:

resource "local_file" "dummy" {
   filename = "/user/test.txt"
   content = data.local_file.test.content
}

Resources are managed resources. Data sources are data resources that are read-only.

Meta Arguments

depends_on (discussed above)
lifecycle (discussed above)
for_each
count

for_each

Used when creating multiple instances of the same resource. for_each will only work with a map or set. This will not work with a list.

variable "filename" {
  type=set(string)
  default=["a","b","c"]
}

resource "local_file" "dummy" {
   filename = each.value
   for_each = var.file_name
}

if we are using a list, we can convert the list to a set by using toset .

variable "filename" {
  type=list(string)
  default=["a","b","c"]
}

resource "local_file" "dummy" {
   filename = each.value
   for_each = toset(var.file_name)
}

Now if we try to output the resources, they are stored as a map. This is only when we use for_each.

variable "filename" {
  type=list(string)
  default=["a","b","c"]
}

resource "local_file" "dummy" {
   filename = each.value
   for_each = toset(var.file_name)
}

output "pets" {
   value=local_file.dummy
}

count

When we use count, the resources are created as list.

resource "local_file" "name" {
   filename = "/root/user-data"
   sensitive_content = "password: S3cr3tP@ssw0rd"
   count = 3
}

This will create the same resource 3 times. All attributes are the same. (eg: Useful when we need multiple instances of a virtual machine)

More commands

terraform validate : whether the syntax used is correct. Even provider configurations as per the documentation.

terraform fmt : format terraform files.

terraform show current state of all the resources. This will show the attributes of the created resources, from the terraform state.

terraform show -json : to print in a JSON format.

terraform providers to see a list of all providers used in the configuration directory.

terraform output to print all outputs or a specific variable.

terraform refresh to refresh the state with the real-world infrastructure. terraform refresh is automatically run by terraform plan and apply. This can be bypassed by -refresh=false

terraform graph visualizing the dependencies in the resource configurations. Output can be visualized using a tool named graphviz

Use terraform destroy to destroy all resources given in the state file.

References

If this post was helpful, please click the clap 👏 button below a few times to show your support for the author 👇

🚀Developers: Learn and grow by keeping up with what matters, JOIN FAUN.

All you need to know about Terraform was originally published in FAUN.dev() 🐾 on Medium, where people are continuing the conversation by highlighting and responding to this story.

Introduction to Infrastructure as Code (IaC)

Keet Malin Sugathadasa — Sat, 09 Jul 2022 18:23:25 GMT

Back in the day, the traditional software systems ran on top of on-premise infrastructure, which was managed by a separate team of engineers with their own hardware.

When a requirement was received, the architects plan the required resources and hand over the requirements to the infrastructure teams. This team will allocate the required hardware resources for this and sometimes purchases the required hardware, allowing the developers to deploy their code once the infrastructure platform is ready.

Within this entire process, the turnover time was significant, causing the deployments to slow down drastically. Some challenges in the approach were:

Slow deployment (certain deployments had specific hardware requirements which had to be provisioned by the infrastructure team. This provisioning and maintenance of infrastructure takes time, causing the deployment velocity to drop down)
Expensive (Maintaining your own hardware came in with inherent costs of upgrading them, networking them, and even controlling your own server rooms. All these were simply to cater to a single organizational requirement)
Limited automation (Except for a few processes, most works around this provisioning had to be done manually. As the system grows, you would require more and more operations engineers to maintain the hardware resources in your on-premise environment)
Human error (Due to the higher number of manual processes, certain pipelines in this process were prone to human errors)
Wasted resources (When the architects decide on the required hardware resources, they would make assumptions based on the maximum load capacity expected. With this, they had to allocate additional resources just to cope with the unexpected traffic from users)
Inconsistencies (Since most servers are deployed separately, each of them needs software maintenance, system upgrades, and OS-level patching. This would have to be done manually to each of the servers, where there could occur certain inconsistencies when trying these on each server separately)

Cloud Service Providers

With time, when cloud infrastructure and services came into play, people started moving all their services to the cloud. Cloud computing came to relieve some of the pains you’ve just read about. It frees you from having to build and maintain your data centers and the high cost associated with it. They really didn't have to worry about infrastructure, because the infrastructure itself was provided as a service to the users. This came with a lesser cost due to their hardware virtualizations and even came with a fancy user interface (aka cloud console) to provide the resources.

This eliminated most of the problems mentioned above related to on-premise infrastructure, moving the burden of infrastructure maintenance completely to the cloud service provider. With elastic and highly scalable environments, the worry was around adding the required configurations and deploying the right software to the cloud.

As software systems grew, still the level of manual work to request and provision these resources in the cloud was high. Human errors and inconsistencies were high as well since the provisioning had to be done manually.

Infrastructure as Code

https://www.spec-india.com/blog/infrastructure-as-code-a-devops-way-to-manage-it-infrastructure

As a solution to this, engineers started using the APIs and SDKs provided by the cloud providers, to access and provision resources in the cloud in a programmatic manner. They started writing their own scripts to provision their hardware and update them. Many used shell scripts, python, and other scripting languages for this purpose. This gave birth to the process of writing Infrastructure as Code (IaC).

Infrastructure as Code (IaC) is the managing and provisioning of infrastructure through code instead of through manual processes.

When this became a common problem in the industry, several communities started building generalized tools for infrastructure provisioning, which became widely used by engineers all around the world and became a wide community around writing infrastructure as code in a human readable, simple, and high-level language.

Version Controlling of your Infrastructure

A major advantage of IaC is to provide support for version controlling your infrastructure resources. Technically, this happens to your code which describes the infrastructure. Indirectly it also applies version controlling to the infrastructure provisioned by that code.

Deploying your infrastructure as code also means that you can divide your infrastructure into modular components that can then be combined in different ways through automation.

IaC Tools

Some of the common tools available out there are:

Ansible
Terraform
Puppet
Cloudformation
Packer
Saltstack
Vagrant
Docker
And many more…

These are all addressing the same problem. But IAC can be broadly classified into three types

Configuration management (eg: Saltstack, Ansible, Puppet)
Server templating (eg: Docker, Packer, Vagrant)
Provisioning tools (eg: Terraform, Cloud formation)

Configuration Management

This was mainly around the required configurations around hardware resources. To install and manage software on existing infrastructure resources. (servers, DBs, networking devices, etc).

These tools introduce an easy way of writing required configuration as code, so that consistency can be maintained in terms of used configurations in the infrastructure resources.

This also gives the ability to control versions of configurations and software updates/upgrades allowing quick roll forward and backward of software versions when required with a few lines of code. In an organization with over 100 servers, if this had to be done manually, the engineers would have to go through each server to update the relevant software.

Another advantage of codifying this process was being idempotent. This means, that no matter how many times we execute these tools, they will only try to bring the infrastructure into the desired state and not repeat the changes over and over again.

Server templating tools

These tools are mainly responsible for maintaining custom images, virtual machines, containers, and templates required by infrastructure resources.

These images and virtual machines come with pre-installed software and dependencies which can be readily used based on the company standards and processes. These are specifically tailored for the company’s software requirements and allow engineers to worry about deploying software than installing the underlying dependencies of their servers.

This makes the infrastructure immutable as well. Once the VM or container is deployed, it will remain as it is, where changes will not be done manually to them. If changes are needed, then we need to make those changes via our server templating tools to update the image and redeploy.

Provisioning tools

This is about provisioning infrastructure resources with a simple declarative code. Based on our required cloud service provider, we simply have to declare which resources we need and these tools will simply provide them for us via the relevant cloud provider's APIs/SDKs.

Terraform is the infamous infrastructure provisioning tool available in the market today which is vendor-agnostic and provides provider plugins for over 100s of vendors out there.

On the other hand, Cloud Formation is also another well-used tool that is specifically built for AWS.

References

Continuous Integration (CI) and Continous Delivery/Deployment (CD)

Keet Malin Sugathadasa — Sun, 03 Jul 2022 12:11:17 GMT

When it comes to a software organization, the primary responsibility of the developers is to implement changes and the primary responsibility of the operations team is to deliver (deploy) the changes to production. This process of sending written code to production involves a set of steps, to ensure proper standards are maintained, the code complies with the security requirements, the changes are versioned efficiently, it is delivered to the suitable clusters and it becomes available to customers with minimal effort and time. Doing this manually every time a change is implemented, could be error-prone, tedious, and time-consuming (often referred to as the integration hell).

In simple terms, CI/CD would be the part right after development to how it gets delivered to a customer. As an accepted definition, CI/CD refers to the automation of these steps, and it is also often referred to as a pipeline. In this article, I would like to explain the basics of CI/CD, why we have it, and how it helps automate the operation side of the software.

Here, I have separated the CD as Delivery and Deployment for clarity. But in general, these are both interchangeably used to represent both terms.

Ref: https://betterprogramming.pub/devops-in-flutter-734cb268d7db

What is CI/CD

CI/CD is the process of introducing automation to frequently deliver application changes to production. The term CI stands for “Continuous Integration” and the term CD stands for both “Continuous Delivery” and/or “Continuous Deployment”.

Ref: https://www.redhat.com/en/topics/devops/what-is-ci-cd

CI/CD embodies a culture, operating principles, and a set of practices that application development teams use to deliver code changes more frequently and reliably. This allows organizations to ship software quickly and efficiently.

Let’s have a look at what CI and CD mean, as separate terms. I would also like to separately explain the two sides of CD (Delivery/Deployment). It’s important to understand each separation so that the right set of tools and processes can be picked to achieve this.

Continuous Integration (CI)

This is a set of practices that drive development teams to frequently implement small code changes and check them, into a version control repository. In many organizations, code changes are often committed to a version control system (eg: git). With these changes, it is important to collaborate with other changes and the other developers working with the same code base. Once the necessary changes are reviewed, they are merged into a common branch often referred to as the master/main branch. This branch would be the final code, that gets deployed to production.

Therefore, it is important to make sure that the changes being merged to the final branch, are working, reliable, and acceptable. These changes need to be verified, reviewed, and tested before merging into the final branch.

Now imagine having many changes, or even having huge changes, that take a longer time to review and test, whereas frequent changes would make this even harder. This hinders the process of delivering code changes to production. Continuous Integration (CI) tries to solve this problem by automating this entire process.

Of course, there is some work up front to set this up and add the relevant unit and integration testing. But it would make the life of developers and operations teams much easier and reduces the number of human errors that could occur. In general, the following areas are covered in CI

Building the code
Testing the code
Merging the code (to a final branch)

When a developer sets up a pull request (or a merge request), the build process kicks in. Once it is built, the testing will run on top of it. Once it is completed, it would be ready to merge.

See the following high-level setup of CI in your code base.

Some common tools, available out there for you to achieve CI, are:

Github actions
Gitlab pipelines
Jenkins
Travis CI
Circle CI
Etc

Continuous Delivery (CD)

Once the builds are tested and verified in CI, those builds need to move to a repository, artifact store, or registry. A common example would be a Docker Image Registry. The goal of continuous delivery would be to get a build ready for deployment in a suitable format. This is often referred to as a production-ready build.

This also contains versioning, but it is often referred to as releases of the build.

Some common tools, available out there for you to achieve CD (Delivery), are:

Github actions
Gitlab pipelines
Jenkins
Etc

Next, let’s have a look at what Continuous Deployment means. Please note that Delivery and Deployment are often used interchangeably as CD to represent both.

Continuous Deployment (CD)

This is the final stage of a CI/CD pipeline. This stage is all about releasing a production-ready build to production. In other words, this is about making the changes available to the customer. In the end stage, the developers' efforts will go live at this point.

In most cases, this happens in multiple stages. First, the release happens in a staging environment. With a manual trigger, it gets released to the production environment. When it comes to CD, there are many deployment models out there, that make the releases smooth and seamless. For example, we may want to deploy the changes to a set of Beta customers in production and then roll it out to everyone in production. These deployment models are a topic for another article.

Some common CD tools are:

Spinnaker
Go CD
Argo CD
Concourse
Screwdriver
Etc

CI/CD Fundamentals

For your CI/CD process to be efficient, it is important to include the following fundamentals when building your CI/CD process for each application. There are no hard and fast rules as to what is the right way of implementing your CI/CD pipelines, but this would guide you towards building a more robust set of pipelines for a smoother delivery of applications.

A single source repository: having a code base for each build would serve the purpose of a repository. This will contain the build files, test scripts, and the actual feature code. Some teams maintain a monolithic repo, doing multiple builds from the same code base. All these drills down to how the system is designed.
Frequent commits: Having focused and small commits in a VCS repo, would help in quicker deliveries and cleaner changes. Multiple commits to the repository results in fewer places for conflicts to hide. Make small, frequent iterations rather than major changes. By doing this, it’s possible to roll changes back easily if there’s a problem or conflict.
Automated Builds: This is to support the CI process by having scripts, allowing automated builds when triggered.
Testable code: This also includes test scripts along with having a code base that is testable. Unit tests and integration tests will help determine issues in the CI phase itself.
Deployment controls: This refers to the way the code should be deployed in production. Some teams use blue/green deploys, canary deploys, and traffic vectoring to reduce risk during deploys. This can actually take many shapes as people interpret the idea differently to suit their needs

Benefits of CI/CD for your organization

Satisfied users: Fewer bugs and errors make it into production, so your users and customers have a better experience. This leads to improved levels of customer satisfaction, confidence, and reputation.
Accelerated time-to-value: When you can deploy anytime, you can bring products and new features to market faster. Your development costs are lower, and a faster turnaround frees your team for other work. Customers get results faster and gain a competitive edge.
Less fire fighting: Testing code more often, in smaller batches, and earlier in the development cycle can seriously cut down on fire drills. This results in a smoother development cycle and less team stress. Results are more predictable, and it’s easier to find and fix bugs.
Hit dates more reliably: Removing deployment bottlenecks and making deployments predictable can remove a lot of the uncertainty around hitting key dates. Breaking work into smaller, manageable bites means it’s easier to complete each stage on time and track progress. This approach gives plenty of time to monitor overall progress and determine completion dates more accurately.
Free up developers’ time: With more of the deployment process automated, the team has time for more rewarding projects. It’s estimated that developers spend between 35% and 50% of their time testing, validating, and debugging code. By automating these processes, developers significantly improve their productivity.
Less context switching: Getting real-time feedback on the code developers commit makes it easier to work on one thing at a time and minimize cognitive load. By working with small sections of code that are automatically tested, developers can debug code quickly while their minds are still fresh from programming. Finding bugs is easier because there’s less code to review.
Reduce burnout: Research shows that continuous delivery measurably reduces deployment pain and team burnout. Developers experience less frustration and strain when working with CI/-CD processes. This directly leads to happier and healthier employees and less burnout.
Recover faster: CI/CD makes it easier to fix issues and recover from incidents (MTTR). Continuous deployment practices mean frequent small software updates so when bugs appear, it’s easier to pin them down. Developers have the option of fixing bugs quickly or rolling back the change so that the customer can get back to work quickly.

References

What is OpenID Connect (OIDC)?

Keet Malin Sugathadasa — Fri, 12 Jun 2020 17:47:05 GMT

OpenID Connect Cover Picture

OpenID Connect is a simple authentication protocol, built on top of the OAuth2 protocol as a separate identity layer. OAuth2 is an authorization protocol, which is being extended by the OIDC, to implement its authentication mechanism. OIDC allows the applications to authenticate and verify the end-users based on the authentication performed by an Authorization Server, which supports OIDC. This also allows the application to obtain basic profile information, about the end-user in an interoperable and REST-like manner. It uses straightforward REST/JSON message flows with a design goal of “making simple things simple and complicated things possible”.

(Identity, Authentication) + (OAuth 2.0) = (OpenID Connect)

https://medium.com/media/31393ee30b8d86544316874322c9338f/href

The OpenID Connect protocol is very flexible in which it gives the power to the client, to easily customize the authentication process according to their needs. OIDC gives the power to clients of all types, including Web-Based, mobile, and JavaScript clients, to request and receive information regarding the authenticated sessions and end-users. The main extensible features provided by the OIDC protocol are,

Encryption of Identity Data
Discovery of OpenID Providers
Session Management

This blog provides an abstract idea of what OpenID Connect is and, its important specifications.

History of OpenID Connect

OpenID Connect is the third generation of OpenID Technology. The original OpenID authentication protocol was developed by Brad Fitzpatick in May 2005. This was more like a visionary tool which never got much commercial adoption, but it got people to think of its possibilities and extensions. In the meantime, OpenID and OAuth were focused on two different aspects of an internet identity, whilst OpenID played the role of authentication, whereas the OAuth played the role of authorization. Since these two extensions were playing a huge role in each of its domains, the need to combine both these protocols arose.

As the second generation of OpenID, it came as an extension for OAuth, which was named as OpenID 2.0. This was better than the earlier version, and it provided much more security and worked seamlessly when implemented properly. Even though it had some design limitations, the implementation of OpenID 2.0 was fully thought through. The third generation of OpenID is the “OpenID Connect.” Unlike OpenID 2.0, this was built on top of OAuth 2.0 as a separate identity layer. The “OpenID Connect’s goal is to be much more developer-friendly and providing a wide range of use cases where it can be implemented. Currently, this has been very successful and deployments are happening on huge scales.

OpenID Connect vs OpenID 2.0

The functionalities available in the OIDC and OpenID 2.0 are pretty much the same whereas the OIDC provides much more API-friendly and usable implementations for native mobile applications. “OpenID Connect” defines optional capabilities for robust signing and encryption. To integrate OpenID 2.0 and OAuth 1.0, we require an extension, whereas in OIDC, OAuth 2.0 protocols, OAuth 2.0 functionalities are integrated within the protocols itself.

OpenID 2.0 used XML and custom message signature schemes that in practice, sometimes proved to be difficult for developers to implement. But in OAuth 2.0, the OIDC outsources the necessary encryption to the web’s built-in TLS (also called HTTPS or SSL) infrastructure, which is universally implemented on both client and server platforms. OIDC uses standard JSON Web Tokens (JWT) when signatures are required. Since JWT is more familiarized and easier to use, this makes OIDC dramatically easier for developers to implement, and practically has resulted in much better inter-operability.

About the OpenID Foundation (OIDF)

The OpenID Foundation was formed in June 2007, and it is an international non-profit organization of individuals and companies committed to enabling, promoting, and protecting OpenID technologies. The OIDF serves as a public trust organization representing the open community of developers, vendors, and users.

This foundation provides a much-needed infrastructure to the community and helps in promoting and expanding OpenID technologies. This entails managing intellectual property and brand marks as well as fostering viral growth and global participation in the proliferation of OpenID.

What Companies & People involved in the development of OIDC?

Contributors include a diverse international representation of the industry, academia and independent technology leaders: AOL, Deutsche Telekom, Facebook, Google, Microsoft, Mitre Corporation, mixi, Nomura Research Institute, Orange, PayPal, Ping Identity, Salesforce, Yahoo! Japan, among other individuals and organizations.

Mobile Network Operators and OpenID Connect

In the modern digital era, we can see a considerable increase in the number of users using online services via mobile devices, and due to this reason, there is an increase in identity thefts all around the world. The GSMA created a valuable business proposal for Mobile Network Operators so that they can join hands with OIDC to implement and render many services to its customers. This business model states that Mobile Network Operators, short for MNOs, with their differentiated identity and authentication assets, have the ability to provide sufficient authentication to enable consumers, businesses, and governments to interact in a private, trusted, and secure environment and enable access to services.

MNOs increasingly are interested in identity services currently being used online (i.e. login, marketing, post-sales engagement, payments, etc.), to mitigate some of the pain points encountered in existing services, in order to meet the rapidly increasing market demand for mobile identity services.

OpenID Specifications

OpenID Connect Flow

OpenID Connect follows the following sequence of steps.

RP — Relying Party
OP — OpenID Provider

The RP (Client) sends a request to the OpenID Provider (OP)
The OP authenticates the End-User and obtains authorization
The OP responds with an ID Token and usually an Access Token
The RP can send a request with the Access Token to the UserInfo Endpoint
The UserInfo Endpoint returns Claims about the End-User

OIDC flow

The flow of events in OIDC

The client prepares an Authentication Request containing the desired request parameters.
The client sends the request to the Authorization Server.
Authorization Server authenticates the End-User.
Authorization Server obtains End-User Consent/Authorization.
Authorization Server sends the End-User back to the Client with code.
The client sends the code to the Token Endpoint to receive an Access Token and ID Token in the response.
The client validates the tokens and retrieves the End-User’s Subject Identifier.

Contacting the Authorization Endpoint

Mandatory Parameters:

response_type: Asking for the response required
client_id : OAuth 2.0 Client Identifier valid at the Authorization Server
scope: Specify what access privileges are being requested for Access Tokens
redirect_uri : Redirection URI to which the response will be sent
state : Value used to maintain state between the request and the callback

Optional Parameters:

nonce: Value used to associate a Client session with an ID Token, and to mitigate replay attacks
display: Value that specifies how the Authorization Server displays the authentication and consent user interface pages to the End-User
prompt: Values that specifies whether the Authorization Server prompts the End-User for reauthentication and consent
max_age: Maximum Authentication Age
ui_locales: End-User’s preferred languages and scripts for the user interface
claims_locales: End-User’s preferred languages and scripts for Claims being returned
id_token_hint: ID Token previously issued by the Authorization Server
login_hint: Hint to the Authorization Server about the login identifier
acr_values: Requested Authentication Context Class Reference values

Sample request to the authorization endpoint

End-user Grants Authorization

If the End-User grants the access request, the Authorization Server issues a code and delivers it to the Client by adding the following query parameters to the query component of the redirection URI.

Mandatory Parameters

code: Authorization Code
state: OAuth 2.0 state value

Redirecting to the provided URL

Contact the Token Endpoint:

A Client makes a Token Request by presenting its Authorization Grant (in the form of an Authorization Code) to the Token Endpoint

Mandatory Parameters:

grant_type : The type of “code” that is being submitted
code: Value you received from the Authentication request’s response
redirect_uri: Used as an extra level of security.

Sample request to the token endpoint

Client Receives Token

Mandatory Parameters

access_token: Access Token for the UserInfo Endpoint
token_type: OAuth 2.0 Token Type value
id_token: ID Token

Optional Parameters

expires_in: Expiration time of the Access Token in seconds since the response was generated
refresh_token: Refresh Token

Sample response to the client with a token

Contact User Info Endpoint

Clients send requests to the UserInfo Endpoint to obtain Claims about the End-User using an Access Token obtained through OpenID Connect Authentication. The request SHOULD use the HTTP GET method and the Access Token SHOULD be sent using the Authorization header field.

Sending a GET request to the user-info endpoint