SW Engineering 101
An Engineers Dilemma in Growing a Product
There is Economics 101, Computer Science 101, but nothing about SW Engineering. What are the best practices in this industry, is it all buzzwords and technologies that keep changing fast? The majority of people who head SW management product development, are themselves not from the trenches. Treating this as another industry, another virtual factory floor with a count of resources is a step to failure or huge time and cost overheads.
This is a very informal take on the subject; an effort to try to list the main themes without going too deep, as the first principles matter the most
As always there are three things to talk about People, Process and Technology.
People following some Process and using Technology to create a Product or give a Service.
Less is More
People matter a lot in SW Engineering. It is not a resource that gives homogenized output like in other industries. Less is more here and more is usually a bigger mess.
“Adding manpower to a late software project, makes it later.”
― Frederick P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering
There could be many reasons for this, one obvious, but usually underestimated is Communication. Even though everyone uses the same language, what one writes or tells is hardly ever understood in the same way. Communication is colored by the context of those involved and understood through that context; of which some could be a mix of these — race, sex, title etc. This is why communication is mostly ineffective, very few think about things on its merit or demerits, but rather who or where it is from, sometimes the mere fact that something is written is good enough. In SW Engineering certain many ideas have become popular this way, usually with the context and import taken in a completely wrong way — one example -Premature optimization is the cause of all Evil (myth busted here).
You would wonder how large organisations like the Army or Postal Workers thrive. The truth here is that these are subdivided heavily and work on non-overlapping goals. Overlapping goals would mean overlapping communication, and if that communication is not kept extremely simple like attack or retreat, it is partly misinterpreted, partly ignored and partly understood.
“Men and months are interchangeable commodities only when a task can be partitioned among many workers with no communication among them (Fig. 2.1). This is true of reaping wheat or picking cotton; it is not even approximately true of systems programming.”
― Frederick P. Brooks Jr., The Mythical Man-Month: Essays on Software Engineering
Amazon’s famous 2 Pizza team comes to mind, and why Software these days are composed of Services/MicroServices created independently by teams, with nothing shared e- no overlapping code, except unambiguous interfaces.
Culture over Process
“Jeff, what does Day 2 look like?” That’s a question I just got at our most recent all-hands meeting. I’ve been reminding people that it’s Day 1 for a couple of decades. I work in an Amazon building named Day 1, and when I moved buildings, I took the name with me. I spend time thinking about this topic.
Day 2 is stasis. Followed by irrelevance. Followed by excruciating, painful decline. Followed by death. And that is why it is always Day 1….”
(Amazon.com founder Jeffrey Bezo’s annual letter)
Even when a company or project is started with the most passionate people, over time mediocracy will engulf. The principles are forgotten, the jobbers [sic] take over. Your company or project team will eventually reflect the state of the world. Day 2 will come. You can put in Process and Technology to handle Day 2. This will help to some extent, but the key to success here, the message that is hidden in plain sight in the above words by Jeff Bezo’s and others is Culture. Good SW leadership is about nurturing the right culture primarily.
Process and Technology can be copied; culture cannot be copied; it has to be nurtured. Before we go to analyze how a good culture can be nurtured, we need to understand what type of culture a SW Engineering organization needs.
The Design mindset
In the start I wrote thus — People following some Process, using Technology to create a Product or Service; but maybe there is one part that can be added here to make it explicit and explain an often overlooked aspect in SW Engineers.
People following some Process, to Design a Product or Service, using Technology.
The keyword is Design which implies creativity, skill, and passion. To make a new copy of the software that is already made do not need humans anymore. To create, maintain and extend- that is what people are used for, and Design in some form is needed to do this.
Great designer programmers created many of the most popular algorithms; however, the same can be written as inefficiently as possible by worse programming (there are no bad programmers only bad programs).
Every system is a product of design, and every system is unique in one way or the other. You cannot get a Dijkstra to design your every system. We need however programmers who can Design, not just implement.
This is a trait that is inherent in all but rarely incentivized, hence rarely developed.
An average enterprise programmer will have not much appreciation for design, as SW design is usually hidden and gets revealed only after a long time. There is no immediate extra incentive for good design,No one to go silly over the curve’s like in HW design, made famous now by Apple’s designers.
However, a good design is the bedrock for quick innovations and fast extensions.
Many managers do not understand this as they never check the code, or take an interest to discern it or promote or reward such people or behaviour.
Usually, they incentivise those who push out features very fast without looking at the long term impact; and when it breaks, the same will play the fire-fighting hero and solving problems under the customers firey breath; and gain more appreciation. A vicious cycle, living on the edge.
The Quality Mindset
There is a school of thought in non-critical SW systems that Quality is over-rated. Some think that quality equates price. But many of high-quality things are commodity priced- Nokia phones, Toyota cars etc. Quality is a very vague term; you can write books on that — I recommend Zen and the Art of Motorcycle Maintenance — which is a journey to find the meaning of quality.
Over-engineering is one element of Quality, giving much more than the buck, surprising people positively, this creates enduring brands.
The evident absence of Quality in SW engineering is Technical debt. Evident by Static Code Analysis Reports and Coverage reports. But the moment these are put as base targets, the targets will be complied to. But one needs to understand that code which passes the analysis and has hundred per cent coverage can be of very low quality. This is one problem with setting targets, the quality of the code and the test will never be questioned once targets are achieved. In such cases, technical debt will be harder to spot- evident only during the review as Code Smells or Desing Smells.
One reason that people believe Quality is over-rated in SW, is usually that the product itself won’t survive if it is not released in time; especially for non-critical SW. This is a valid argument. Iterative development, iterative release. A good SW Engineer should know where to draw the line. That choice on how much to release, how fast also need an intelligent thought; but that does not mean it should be released with low Quality. The Google Chrome browser story is an example. The renowned SW Engineer -Martin Fowler had this to say
High-quality software is actually cheaper to produce. https://martinfowler.com/articles/is-quality-worth-cost.html
Right now there are huge enterprises that are not exactly SW Product houses, but whose products have so much SW in them as specialized HW, who are bleeding huge amounts of cash because of not having the Quality mindset. Companies like Boeing are evident because of the negative publicity. There are a lot more out there hiding in plain sight. One smell is releases getting postponed.
There is no dearth of literature or techniques to avoid this pitfall. But management ‘suits’ underplay the incredible damage that ‘rank and file’ SW developers can do if Quality culture is not stressed enough. The intention is not to disparage ‘suits’ but to highlight common management psychology. Technical leadership needs to be groomed. It is as important as management leadership.
Expectations and Toleration
In SW Engineering as in life, it is not that one does not know what is the right, the most important part is what you tolerate, or rather how much you tolerate.
There are the words from a specialist soldier, from his management book Extreme Ownership
“When setting expectations, no matter what has been said or written, if substandard performance is accepted and no one is held accountable — if there are no consequences — that poor performance becomes the new standard.”
“It’s not what you preach, it’s what you tolerate.”
Jocko Willink, Extreme Ownership: How U.S. Navy SEALs Lead and Win
SW Engineering is a field which needs high levels of skill. However, the entry barrier for this industry is pretty low. Here is this conundrum from one of the most celebrated SW Engineers of all times Dijkstra
I came to the conclusion that programming should be considered as one of the most difficult branches of applied mathematics, because it is also one of the most difficult branches of engineering, and vice versa. Obvious as this conclusion is now, it was at the time a hard conclusion to arrive at, for it was then very unpopular. It was a conclusion that met violent opposition when I tried to share it with others. The usual form of rebuttal was pointing to the hundred thousands of people then employed as programmers, the majority of whom had at best third rate intellects — it was a profession that had attracted many drop-outs and people without any formal training at all — : in view of that mediocre multitude the opinion that programming was among the most exacting of all human endeavours was clearly sheer nonsense! https://www.cs.utexas.edu/~EWD/transcriptions/EWD05xx/EWD533.htm
The only way to make sense is to understand that this is not in fact limited to SW Engineering but in almost every other field. Here is the thought echoed from a Nobel prize winner
“One could not be a successful scientist without realizing that , in contrast to the popular conception supported by newspapers and mothers of scientists, a goodly number of scientists are not only narrow-minded and dull, but also just stupid.” ― James D. Watson, The Double Helix
In SW Engineering because of its wide scope, it is more prevalent. Success will never happen to those sitting in the middle. Most enterprises are managed this way that everyone would like to sit in the middle. No one wants to be intolerant of inefficiency. There is a huge incentive not to shake the apple-cart. As is said — There are choices in everything you do, in the end, the choices you make makes you.
It is best to have the Agile manifesto in mind when we talk about Process, as this is the easiest to go overboard with-
“Individuals and Interactions over Process and Tools” — The Agile manifesto
Here is the thing with Process one needs to understand — all process will fail, many will be misused spectacularly, and then a newer process is created and this cycle continues. When there are People problems the obvious solution seems to be Process. But giving too much weight to Process is usually an anti-pattern. Let us start off by clearing the air of one of the most misused word in SW Engineering.
Agile is not a Process
Agile is an anti-process and hence it cannot fail, nor succeed. You can be agile, you cannot do agile. This is oft-repeated by those very who made this popular but drowned in the din of Agile Sell.
This is a pattern seen everywhere, the founder becomes the outcast. Those who claim to be followers and self represent ideologies, strip them of the spirit and turn into dogmas that are preached but practised in totally contrarian ways. A faking.
Scrum is an agile practice, so is Kanban, you can follow a Scrum process but not an Agile process, where Agile stands for a noun; as a verb it is fine.
Analysis before Implementation
Write an Analysis Document. Split into User Stories to reduce cognitive load, and for iterative development. Write an Acceptance Test case for each User Story during Analysis. Agilers [sic] may complain of too much overhead. Don’t get carried away by current fads. Stand your ground. It is harder to Think than to Do. It is harder to write down your thoughts (and mostly see how absurd it looks on paper).
Analysis ensures enough thought; ensures less work, and much less re-work.
Prototyping is a tool for effective Analysis; a means to test out assumptions. For complex topics, this is a must. It uncovers parts that cannot be thought through. It makes estimates better. The prototype should be discarded and not taken into production.
An estimate is an Estimate — nothing more
It seems silly when PM takes some numbers that come out of an Engineers or architects head and sets that in stone. I read this somewhere, that an Estimate is an estimate and nothing more. Current systems are usually too complex to be thought through, analyzed or tested through. Could the flight crash be prevented? Or is it a sign that the best technology or process or people cannot fix broken first principles. And what are those first principles? everyone knows that, but it is easy to forget or to underestimate risks “Do the right thing always” and “Only the Paranoid survives”
What I remember that struck me coming from waterfall way of working, was instead of filling in a tool the hours worked daily, in Scrum, it was just the hours remaining to complete a task that counted — the burndown chart. And that’s profound. Many still have not got it; they still estimate X hours and every day dutifully ‘burndown’ X minus 8.25 hours. This is exactly how people do things without understanding the principle. There is no value in how much time you worked. All that matters is how much you think will it take to finish the task. Usually what is estimated to be simple becomes the most complex when worked on.
People are used to estimates that are near actuals in their daily life, servicing a car, painting your house, interior design etc. But in this industry, since it is dealing not with physical parts that interact in a certain way or go bad in a certain way, but constructs of the mind, the complexity can be immense. Most estimates for new things are usually way off, very very off, sometimes years off. Only after a prototype can there be a chance of a better estimate.
Iterative Development, Iterative Release
Even if you are designing a telecom system, iterative development helps; skipping a number of near term deadlines will give a fair warning, and time for corrective action to all. This is much better than missing the final deadline without any warning and no idea of how much longer it takes.
Do Scrum, if you or your team like it; or use the method Toyota uses in their factories and even in service centre -Kanban — visibility of work in progress. Or maybe nothing, skip all useless ceremonies, it actually does not matter much, if you have good communication lines open with all.
Do note that asking people to do, and following up rigorously will just give you what you asked. Leaving people to themselves will usually surprise you pleasantly. Keep this in mind; challenge, but do not overdo it.
The key here in iterative development is breaking big tasks into smaller tasks — “User Stories” and releasing them also iteratively.
Testing is part of development. Test code is code. Test-driven Development was popularised again by Kent Beck in Xtreme programming. It basically is a way of making the code testable, by writing test cases during development time, where code and its structure or design is still malleable. The other way is when code is completely done, by which time it has taken on a certain design and structure, that becomes pretty hard to change when during the writing of tests it is discovered that the system is hard to test. Time and again some people have bashed this, by being too fanatic about the exact ritual or process to be followed in TDD, without understanding the above basic spirit. More on this served here
Test Driven Development — impractical ? The world needs a better name- Code a Little, Test a Little
Let us not fight over the definition; let’s embrace the spirit.
The essence in many of these Agile process, is Xtreme Programming principles, iterative development, test-driven development, peer review of code, etc- and it works, even without the fanfare of a particular methodology.
The Iterative Release
There will be few who have not heard about Continuous Integration and Continuous Delivery. Why Continuous? To understand that we need to go back some years where the majority of SW Engineering was done by a serial process called the ‘waterfall model’. Basically, Requirements Gathering — leading to a requirements specification document, after that Analysis phase resulting in Analysis Documentation with main implementation features identified and many times identified by codes for Traceability. And from there to the Design or implementation phase and finally an unofficial Integration phase where multiple components that have been developed in Isolation are integrated and thrown over to System Verification, be a different test team. Many times the integrations are smooth, but sometimes it is not; it is discovered too late in the cycle and to rectify too many changes have to be done. Whenever the requirements are very well defined, the problems are less. But as we neared early 2000, due to the impact of the internet, the Linux OS and OpenSource Software system, SW system development suddenly was not limited exclusively to a few big enterprises. It became business-critical to evolve and adapt fast. Suddenly the old way of doing things was no longer okay. You had to be ‘Agile’ to succeed. The old tenant of Requirement freeze was out, adapting fast to the changing requirement was in. This meant that late integrations always meant a lot of rework as many components or the requirements itself had changed drastically from the specification., not to mention that specification itself was pretty loose.
Kent Beck made this popular via the Extreme Programming paradigm. Here is an extract from his book
Another important reason to accept the costs of continuous integration is that it dramatically reduces the risk of the prooject. If two people have different ideas about the appearance or operation of a piece of code you will know in hours. You will never spend days chasing a bug that was created sometime in the last few weeks… The “production build” is not a big deal. Everyone on the team could do it in their sleep by the time it comes around because they have been doing it every day for months. -Extreme Programming Explained: Embrace Change By Kent Beck, Erich Gamma (excerpt)
Note that XP also heavily emphasised on Pair Programming and that aspect is reflected as well in the above paragraph. But we are ignoring that, as the practice has grown out of that part in the industry. Continuous integration was an idea whose time had really come.
The old way is shown above. Multiple teams work in isolation for quite a long time interacting via meeting (not via code integration) and final integration towards release.
The new way is integrating early, it is easy to say and write here, but I have seen many teams struggle on this; as to make it work effectively and efficiently, it needs smaller components, good SW development practises- TDD, Static Analysis, Code Reviews, unambiguous interfaces and many other. I have listed this in one slide under the DevOps heading for want of a better term- SW Engineering 2.0 anyone?
The Continous Deployment (and related Continuous Release) part is much harder to do; as it is the proof of the pudding. If you can trust your test cases, your ‘Green builds’ enough to deploy to live production and announce it (release) as well, then it means CI is not just lip service.
For this the system should be designed ground up to be testable; rely on the API layer to do the majority of testing. It is deeply frustrating and hard to base end to end test on the GUI. I have added more thoughts on testing here.
End to End Testing in MicroServices
First things first. Testing is just a (small) part of SW Quality; It is not the whole. Everyone knows this. I know it…
But apart from that, the Engineering culture should be in place to practise this in reality. Traditional program management culture is deeply rooted in decades-old enterprise culture, and how much ever everyone talks about CD, you can still see monthly or quarterly releases. For new products or services, it is important that this is practised early on so that automated end to end tests and monitoring can improve from real-life feedback and get better and better with time, giving more confidence- a positive reinforcing cycle.
Less Code is better than More Code
The vast majority of code and features is useless; less is more here too. You can be paid for two things, to create value and /or to reduce waste. But do not create waste in the pretence of value; sometimes doing nothing will save your company a lot of money. First, you need to be in the Zen state of mind and fight your impulse to work/implement do something. No, I am not advocating laziness, more like think ten times or more and see if it is really necessary. You need to avoid the white noise business that all of us like to fill ourselves with; we need to confront our loneliness our aimlessness and keep that aside and then decide. If you have nothing to do — well go home and rest, maybe under an apple tree. Rest is essential for thought and ideas. Rewriting all the code to fit some fancy framework or fad of the day is an example.
The practice is pervaded by the reassuring illusion that programs are just devices like any others, the only difference admitted being that their manufacture might require a new type of craftsmen, viz. programmers. From there it is only a small step to measuring “programmer productivity” in terms of “number of lines of code produced per month”. This is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view. My point today is that, if we wish to count lines of code, we should not regard them as “lines produced” but as “lines spent”: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger. E W Dijkstra
This is contradictory to common sense. It looks like an anti-pattern since this is usually understood as trying to do something simple in a very complex way — like https://github.com/EnterpriseQualityCoding/FizzBuzzEnterpriseEdition. But it is more a matter of naming, over-implementation is different from over-engineering. It is almost like telling that one does not have a better name than Murder, but it is really much closer to Love. If we can forget about the naming; then what I am talking about is building above the minimum specification; the Apple Design theme ( MagSafe, Retina Display etc), the Nokia mobile phone’s reliability, Toyota Quality and other similar. Typically the values we associate with the Brands. There is a similar value that is behind SW products, but customers do not experience it directly. It is noticed only in its absence. Think of a day when you are no longer able to search or shop due to the non-availability of Google or Amazon services. Very few know the complex engineering behind these very simple interfaces. These systems are designed to scale — imagine the videos being uploaded to Youtube and Google Photos by billions of users and made accessible to billion others, and hiding the enormity behind effectively.
You may think these are big corporations with huge budgets, but these were tiny companies that have grown due to the value in their engineering. The idea and its implementation are equally important.
If you are asked to build a passenger car, will you go rouge and design a race car. This is what people think when they think of over-engineering. However, in this industry, things are different. Requirements are always unclear and underspecified. The product management would tell the first version need to support only 10 users. The architect and developer would crank out a system in no time. User love it, it has become pretty popular; but now the system is crashing as it was not engineered to handle the load. This never happens for buildings, bridges or railway cars. But in this industry, the engineering effort is underestimated. You can still have the resources to support ten users, say a server or a few compute instances. But the systems should be designed for linear scalability and be horizontally scalable. That means every component is the system should be designed for linear scalability. This is easy to overlook; Example relational databases looks to be linear scalable for some time, till it is not. Which does not necessarily mean that you should not use them; but you should use them where they fit.
This over-design may seem a lot of upfront work. It is a bit more upfront work, but mostly it is in choosing the right technology or service and may be testing it, but rarely developing from scratch.
I have seen that all successful products grow, either in users, or devices or data. There is a popular adage “Keep It Simple S***id”, or “Premature optimisation is the root of all Evil”. Well Keep it Simple but — no Simpler is another adage by a much more respected individual, and regarding optimisation; optimize what is the question? Things that are the base of the system, the backbone, are very very hard to “optimize” later on. The rest can be optimised even later. As said -make things simple-but no simpler, that is stupid as well.
Everyone means something different when they tell DevOps. Developers doing part operation role, Operation doing part automation of manual activities etc.
The CI/CD part: First things first -DevOps is not just CI/CD. Automation of testing and automation of deployment are very much part of it.
Also, DevOps is not a team, or a team name, it is a process- a way of working. Google’s Site Reliability Team and Site Reliability Engineering is a better way to name teams and people. SRE are developers automating away manual operations part.
Operations Automation: Almost all SW Systems have the bulk of the code from the third-party or open-source systems. Platform services like Cassandra, Elastic, Prometheus, Kubernetes, Grafana etc and even lower bare-metal, disks, S3, networking etc. This is an ice-berg hidden in plain sight. Operations are not just operating your microservice or your code. It is understanding a bit about the stack that you use. There needs to be a mix of traditional Operations teams who are good in low-level HW, Networking, disks etc and developers who can help automate repeated tasks. The easy way out is using Cloud services, PaaS, even Database as a service, Identity management service etc.
The NoOps story: So the idea would be to automate to virtual zero manual operations. It is almost a pipe dream to create an automation that will remove all humans from running a system as even if we could, there are a lot of bugs and corner cases in each of these systems on which reliable automation is hard to build. If you are not convinced check the bugs in evolving sub-systems. Or take the case if, with all the automation, someone deleted or corrupted your precious data, or few nodes in your system (Cassandra, Ceph, or something else) have crashed and you need to restore back. Sure companies like Google could have many of the automation done for this. However very few are open-sourced, unlike the frameworks, and the cost of automating is increasing as the complexity of the solution is too. You would need humans in development, you would need humans in operations as well, or you need to pay quite a decent sum to buy these cloud services. Here is a related article.
DevOps needs Feedback from the Ground
Monitoring The key to good operations is visibility. For that monitoring from the Cloud or Edge Cloud deployments are also needed. You need to monitor your application using application-specific counters. You need to monitor your RPC framework — no of requests, response, latency, errors etc. You need to also monitor the compute and memory resources used by your applications.
Running test cases in production -Canary Testing (via a Blue-Green release model) is also a sort of monitoring; feedback on your release. Releasing to part of real traffic is now made easier with technology like say Istio. (here is a better link about these terms as it can be confusing)
So in short DevOps is not one thing. DevOps is in short a set of practices and related technology for managing complexity in SW Engineering.
There is a lot to write about technology and a lot that is written. So we will write very less here. Here are some of the most overlooked ones.
All abstractions are leaky abstractions
https://www.joelonsoftware.com/2002/11/11/the-law-of-leaky-abstractions/ — StackOverflow co-founder
Technology is the abstraction through which we deal with the computer, but the first rule one should bear in mind is that all abstractions could be leaky, and you may need to understand the framework in detail to find and fix the root cause. Let me give an example, Coming from C++ the world of hairy memory leaks, automatic garbage collection in Java was something we were in awe of. But many went head in, and unwisely, creating too many short-lived objects, mostly misuse of strings. Java’s GC was supposed to handle everything right? No, not all, the system was soon having the side effects of GC pauses, which took some time to figure out- I forgot the whole story, but i was a backend system and the GC pause manifested in a strange functional bug initially. Soon all the complexity of Garbage Collection suddenly shown through, a simulator has to be written to prove that it was due to GC first. Then to find a way to fix the GC problem. We spend a month trying to tune the system, understand the problems of huge heaps and trying out different GC strategies. It became almost as complex or more than isolating hard memory leaks in C++. This part is there in almost all software systems at different levels, whether you are hit by the GIL in Python,the non-scalability of SQL database, the split-brain for master-slave system in High Availability mode, or the complexities of Cassandra — tombstones, problems with materialized views or whatnot abstractions are still out there, that can spring a surprise leak.
Mostly an abstraction is built with a specific use case in mind, or some use cases not factored in. When a system sails into this unchartered territory it springs the leaks, and things which are abstracted so cleanly start to seep through.
It is faster and easier to build things, but harder to really test thoroughly, operate and maintain due to this. I have found that running long — days to weeks of load and stability test uncovers a lot of such problems, rest only time reveals.
Unambiguous Interfaces over Ambiguous one’s
Most modern systems are built by teams that are globally distributed. Communication between people is very imperfect. Hence this is one of the worst ways of structuring teams. Nevertheless, to reduce cost and hedge the risks, there are many distributed sites in the current enterprises. Usually, each site work on their components and the system is a set of services that interface together typically via web services usually REST.
Here is the thing — REST is not meant to be a programmable interface, it is a simple general interface for web applications to speak to the web server over HTTP — GET, POST. We had the WSDL files in SOAP before REST to specify interfaces more strongly; but SOAP/WSDL was complex, versioning was problematic and was generally a mess that everyone was happy to put behind, and embrace REST for everything. I had started with MS COM, then CORBA, SOAP and was as happy as the anyone to use REST and the lightweight JSON instead of the XML payload. However, when working with multi-component systems created by distributed teams, the vagueness of the REST ‘interface’ starts to become a major problem. When agreeing on REST interfaces there are people interactions and many assumptions; since there is no versioning except by some convention, and no static typing that can be tested by a compiler, everything here is in a best-effort mode and it breaks down spectacularly when the number of components and number of distributed teams that interact together to create the components increases.
In short use a static typed versioned interface, that is easy to agree on, non-ambiguous and easy to test with a machine. I won’t repeat here why I think GRPC is the best thing after sliced bread, but this is served here -
REST is not the Best for Micro-Services GRPC and Docker makes a compelling case
For quite a long time, when Service Oriented Architecture (SOA) and WebService were the talk of the tech town, most of…
Open Closed Principle
In object-oriented programming, the open/closed principle states “ software entities (classes, modules, functions…
The first principles should always be followed; technology can help in ensuring type checking and versioning, but programmers should take care not to break interface contract, modify interfaces, only extend, ensure backward compatibility. Similarly, though even when an interface is technically unmodified if its implementation changes, that is semantic incompatibility -yesterday you accepted any strings, today you accept only strings in email format; then again things will break. When the number of components and teams increases, it is almost impossible to think through and manually orchestrate a breaking change across. Better avoid such changes; the Open-Closed Principle is a golden one and is valid here for the interface as well as for the implementation part. There are good tools out there to ensure ‘semantic typing’- if I can call it that.
Use a strongly static typed language for Production
There are some arguments that cannot be won, static vs dynamic programming is one. I remember a story in this context.
In a village, there was a dispute between two people over something. Each believed in their version and dug deeper and deeper into the argument to prove points and counterpoints; so deep that they thought that an impartial third party can easily see the proofs and arguments and identify their version as the truth. So they consulted a hermit who was widely revered. The first person went into the hermit’s house and spend an hour putting out all arguments and proofs. The hermit listened patiently and finally said — “You are right”. The second person went into the hermit house and spend another hour in spewing out all arguments. The hermit listened patiently and finally said — “You are right”. He too went out happy. The hermit’s wife was listening to both the arguments in the other room. Just as the other person left the hermit’s wife came hastily from the other room; “How can both of them be right? only one can be right!”. The hermit listens patiently and say’s “You are right”.
I am not sure if the import of the story is understood; but having used both in production for some years I can tell that for me if I were to write production code, I will choose a statically typed language any day. And when I say for me, I mean the thousands like me, mortal programmers who make mistakes on every line I write, and grateful for the compiler and the linter; and exclude all God Programmers out of this. Knowing humanity you can be pretty sure, there will be people who can write a say -OS in assembly, and ask for more.
MicroServices — Smaller Decoupled Services
I guess it won’t be right to skip this part without dwelling on the latest greatest architectural style now.
Every few years some new trend comes. I started with MS COM (Component Object Model), and when I could register a new version of a COM DLL right on a live production system, and watch the flow of code through the new DLL I was as excited as the next. COM saved us from DLL Hell, then came Java, that saved us from Machine independence — the VM was the most hypes, along with automatic GC. And then MS ditched their COM and DCOM and went with .Net a VM copy of their own. All this while OMG CORBA was ambling along and COM guys stared using that; till WebServices and SOAP came along, with huge marketing push by vendors. So we buried CORBA and did SOAP and WSDL’s and XML’s till JSON came along and REST shone a lighter path. Somewhere along, there was such a scream and tear about Enterprise Java Beans — EJB and when we started using it, we knew it better not to touch it. Those were the heydays of IBM WebSphere and JBoss, bought by Redhat. The ESB — enterprise service bus, the scene was heating up, and many ditched EJB’s and went for that. Finally, REST and services overpowered most, till someone woke up and thought- hey REST is never meant to be a programmatic interface, it was an architectural style for the World Wide Web, where a simple and general interface was the need. By this time many got burned by the ambiguity of REST interfaces. Here is the strange thing in technology. Everything is nice and sweet till you live to experience the bad side. If you have not, you can keep on arguing and arguing and never understand till you have gone all-in and burned yourself. Google — one of the largest microservice ecosystem in the world released GRPC into the open with little fanfare or marketing. They knew better than to use REST. And so we have now GRPC and (micro)services. Container technology — specifically Docker via its Docker build file concept and immutable layers gave a leg up to something better than VM’s -immutable infrastructure. Soon managing Docker containers became a task in itself and Kubernetes filled that void — a bit complex as it is too generic, but it stood as the best of the lot.
In programming languages, Microsoft understood the problems of not having strong type checking and released TypeScript. Enterprise developers started feeling that most Object-Oriented paradigms are just ceremony and Java was getting too verbose and complex with Oracle putting its feet in. Golang stared getting adopted more as a better Java alternative, better than Scala too. The number of technologies and frameworks that have come and gone or evolved and morphed in the last twenty years is huge, and I am not scratching the surface but thought I will give a feel of how things change, as things keep evolving very fast in this industry.
With this context, and keeping Serverless in mind, let us come back to microservices. The first questions to ask is Why, instead of How. Why microservices in the first place?
Why not Tired Architecture — Application Servers /DB Tier, Web Tier? Why are microservices widely adopted or gaining popularity in all high-profit companies; Amazon, NetFlix, Google? Will we be more profitable if we adopt this? These are the right questions. Before we answer these, we need to realise that the problems you don’t know you have, are usually your biggest problems.
Some architects think their System is tiered, service-oriented, message-based, superbly designed… CLOUD based etc. Others may think the reverse- my system is not containerized, not cloud-based, not written in Go or Erlang or whatever.No architect, or developer is thinking if there is a problem somewhere, a bigger problem
What could be the Problem?
Even when your System Architecture is beautiful, theoretically right, academically correct if you take a year to release a small feature to your end customer, is it any good? Mostly the structure of System Architecture has a close resemblance to organization structure- Conway's law.
Organizations which design systems are constrained to produce designs which are copies of the communication structures of these organizations — Conways law “How Do Committees Invent” http://www.melconway.com/research/committees.html
This is a problem. Humans may love each other; Teams definitely don’t lose much love between them. You may want to limit the adhoc responsibility-sharing between multi-polar systems. You don't want teams to throw half-finished work between them, as part of the collaboration to take a feature to its release.
Amazon realized they have a problem around 2001.
“If you go back to 2001,” stated Amazon AWS senior manager for product management Rob Brigham, “the Amazon.com retail website was a large architectural monolith.”
“Now, don’t get me wrong. It was architected in multiple tiers, and those tiers had many components in them,” Brigham continued. “But they’re all very tightly coupled together, where they behaved like one big monolith… that monolith is going to add overhead into your process, and that software development lifecycle is going to begin to slow down.”
We measured the amount of time it took a code change to make its way through that deployment lifecycle, across a number of teams. When we added up that data, and looked at the results, and saw the average time it took, we were frankly embarrassed. It was on the order of weeks.”
For a company like Amazon that prides itself on efficiency — for a company that uses robots inside of our fulfilment centres to move around physical goods, a company that wants to deploy packages to your doorstep using drones — you can imagine how crazy it was,” he said, “that we were using humans to pass around these virtual bits in our software delivery process..
Amazon turned to SOA ….Jeff Bezos style
Here is a post from the famous ex-Amazon ex-Google employee Steve Yegge’s that got published accidentally; He recalls that one day Jeff Bezos issued a mandate, sometime back around 2002 (give or take a year)
1) All teams will henceforth expose their data and functionality through service interfaces.
2) Teams must communicate with each other through these interfaces.
3) There will be no other form of inter-process communication allowed: no direct linking, no direct reads of another team’s data store, no shared-memory model, no back-doors whatsoever. The only communication allowed is via service interface calls over the network.
4) It doesn’t matter what technology they use.
5) All service interfaces, without exception, must be designed from the ground up to be externalizable. That is to say, the team must plan and design to be able to expose the interface to developers in the outside world. No exceptions.
The mandate closed with:
6) Anyone who doesn’t do this will be fired.
Service-oriented architecture promised modular components each providing a web interface API to interact with it. A lot of other companies and products embraced SOA as well; it was heavily marketed - SOA, Web Services and the like.But many missed the crucial parts of why they did what they did.
Amazon adopted SOA and was working on Continuous Delivery* so that they could use SOA and make feature releases faster and faster. Remember — System architecture is not the end goal- System Engineering for the business need is. (*http://www.allthingsdistributed.com/2014/11/apollo-amazon-deployment-engine.html)
The reason, the Why Micro-services, is for faster and faster delivery; That is once the feature is done, no delays due to team structure or system structure, or lack of automation testing or any other reason. For Continous Delivery, a service-oriented architecture is needed, and microservice is the current evolution, as we don’t want to say SOA and services becoming too big and monolithic in itself.
Microservices as term was coined around 2011, but was a progression from SOA. Martin Flower, the famous Thoughtworks consultant had documented many aspects of this. There is no one definition.
Here is one that highlights the why part into the what :
MicroService is an enabling technology for continuous product deployment,using immutable containerized software modules, that adhere to typed versioned interface and lightweight inter module communication mechanism. Typically container management, centralized log and performance collection and monitoring, high degree of automated tests and continuous deployment pipeline are needed to create a working microservice based system.
Note that there are some hidden problems with MS. The first is trying to find where the problems are. You need to have Tracing in the logs and a trace-id propagated through all callees. You also need good timeout and deadlines. I do not like to recommend any specific technology here, but getting all this right without GRPC is pretty hard now. Also using a service mesh like Istio and the tracing helper libs helps in gaining visibility in what is happening in your microservice. This is essential or you are setting up for trouble.
The other part is more intangible, basically who owns the service. Soon there will be more microservice than developers and the older ones can get forgotten, till a bug or update calls for ownership. As long as one team is mapped for maintenance this is fine.
There are a lot more to write about good technology, horizontal scalability and the importance of linear scalability, master-less high availability (master-slave is always problematic), machine learning and deep learning technologies, etc. But this article is an attempt to list our the basics. Do drop in a comment if you have some suggestions.