Hi Ronen, interesting read, well done.
Sebastian Coetzee
11

Response to questions on Breaking the Monolith

Thank you so much for your kind words and insightful questions, Sebastian Coetzee.

You say that you gave the credit risk department an interface to update their credit scoring engine. How much freedom do you give them to change things?

Within the context of the rule engine, Credit Risk has the freedom to change rule ordering, content and associated output, using a web UI, without any support from Engineering. Each rule comprises input parameters, operators and operands, combined with ANDs and ORs, and an associated output value, which can be a complex datatype. There is no engagement with Engineering when Credit Risk makes these changes. Engineering would need to be engaged, for example, to provide additional input parameters to the rule engine, or to apply the rule engine to a different context.

Do their changes go through a testing process first or are they immediately live?

This is an excellent point. I would like to see us implement a staging environment for the rule engine with the ability to run simulations, and a controlled way of deploying these rule sets into production. However, these requirements would need to originate from and be driven by the business.

I assume you tried implementing something a bit more performant and rigid like Scala or Java and didn’t get much traction in the team? Do you think the effort involved in skilling up in a new stack outweigh the gains? Why did you want to move away from Rails?

Yes, the new stack proposed was Scala-based, because of concerns around the long-term scaling characteristics of Ruby, unease around Ruby’s lack of statically-compiled type-safety and general enthusiasm for Scala. Ultimately, delivering services on a homogeneous stack facilitated the rapid satisfaction of business requirements, because Engineers were already familiar with the relevant platforms and frameworks. Satisfying business requirements in a timely fashion will tend to win out.

If you have all the user data available, why is credit scoring done in batches? Is it not possible to score the applicant on the fly if this data is available?

Scoring can be done either on-demand or in batches. Both modes of operation are supported, and we use both.

Would the batch nature of this mean that when credit scoring rules are tweaked by the credit risk department the entire database of millions of customers have to be re-scored?

Another excellent point. Currently, the system recalculates scores on-demand, but if we needed to re-score the entire customer base, for example, prior to a marketing blast, following an update to the scoring rules by Credit Risk, what you are saying is correct. We have the facility to easily initiate a batch re-scoring. It previously ran on a daily schedule.

How difficult is it to make the shift to using asynchronous HTTP calls between microservices instead of making method calls directly in your monolith?

It’s difficult. For example, there is a large core of legacy code in the monolithic application that we have long wanted to migrate to a micro-service. However, the rapid rate of change to that piece of code, and the technical debt from elsewhere in the application that has become encroached within it, have made the migration daunting. We are making headway by first refactoring that code, supported by tests, into a discrete component with a sensible interface and internal architecture. It is this refactoring where the majority of the work resides — so, not in migrating to the micro-service, as such, but in first revisiting the legacy codebase so that the target component is in a state suitable for extraction. It’s much easier to satisfy green-fields business requirements with micro-services than to extract micro-services out of a legacy application. I can expand on the extraction itself, if you like. I can also expand on the distinction between synchronous and asynchronous invocations, and why I consider this to be a solved problem.

Does it greatly increase development time or do you find that the separation of concerns allows the teams to spend less time in “analysis paralysis”?

So far, I believe it has reduced development time and volume of operations, because the components running as micro-services tend to ‘run cool’ and tick along successfully by themselves. When they do run into issues, they are usually easier to troubleshoot than issues with the legacy application, simply because the scope of things that can possibly go wrong is reduced — it’s much easier to find a smoking gun. Virtually all development work and ongoing operations still target the monolithic legacy application. In the cases where a micro-service is implicated, the work is usually simplified because the micro-service can be holistically grokked, tested and deployed without much ambiguity or fear of side effects.