[Steven Wilson “The Raven that used to sing—and other stories”]
Mark Jason Dominus wrote about “Moonpig”, which is his and Rik Signes work on building a billing system in Perl. He wrote up his story over on his blog. At Issuu, Francesco Zanitti, Anders Fugmann, and I wrote a system which has some similarities to Moonpig. This is a hash through that system. Currently, we mostly handle post-paid services.
Billing mostly sucks. It is one of those areas of software where the realities of the world clashes with your nice logically structured code base. This clash often produces problematic complications in systems for one reason or the other. Furthermore, to rub salt into the wound, most billing solutions out there sucks even more. So we decided, like Mark and Rik, to write our own.
In hindsight, there are things we should probably have done differently in the structure of the system. And there are things we should perhaps have treated in other ways. But the system is in production and runs. It rarely gets updates.
When we wrote the system, two papers inspired us. “Composing Contracts” by Peyton Jones, Seward and Eber describes a small contract language for financial transactions. And this language is compiled with several compilers to produce different interpretations of the same syntax.
The other source of inspiration was the 3gERP project at DIKU lead by Fritz Henglein. Of interest is Tom Hvitved’s Phd dissertation, which describes an ERP system based on contracts and event logs. Roughly the idea is that events in the system is sent to a persistent log which is never deleted. Code in contracts read events from the log and proceed to execute. A central aspect is the concept of replay where a contracts state can be replayed from the start of time. This provides Point-in-time-recovery (PITR) of any contract at any point, as well as an audit log. Current contract state is kept in-memory and is never persisted. Only the events that will lead to that state.
Our system—and the 3gERP model—consist of contract processes and agent processes. A contract poses obligations which are read by agents. Once agents fulfill the obligation they send back transactions to the event log—and transactions are subsequently picked up by the contracts. A powerful concept in the model is that of blame. If we can’t proceed in a contract due to a time constraint being hit, we can see which party had the obligation of an action. This means we can identify if it is the customer or the company which made the error, and handle it accordingly.
Since our system is written in Erlang, agents are separate (Erlang-)process groups. This allows our system to handle events concurrently, and allows multiple subsystems to proceed. We can also replace standard contracts and agents with mock variants for testing.
Currently we handle all contracts in a single Erlang-process, but this is clearly a design mistake. We could get better isolation properties by splitting each contract into a process of its own and then let contracts run in a truly concurrent fashion.
Another mistake is that we route obligations and transactions over buses inside the system. But it would probably have been better to make contracts to direct-calls through a routing/proxy layer to a target process. The current bus-model is not very Erlang-idiomatic.
The key observations about billing
Here is the primary key observation about billing:
“Profit scales proportionally with the billing system load”.
Modern computers are incredibly beefy. We can run a gigantic billing system in memory on a 7.5 or 15 Gigabyte instance on Amazon EC2. Even if we outscale it and need a 244 Gigabyte machine at almost $7 per hour, chances are our profits are such that this amount of money is peanuts.
Hence, worrying about system performance is—for most companies—going to be premature optimization. Much more important is correctness, durability and resilience. Thus, the focus should be on that and not lolspeed. This is a welcome change of pace.
This key principle is an efficient driving force behind design decisions. You can opt for the simple and verifiable algorithms and data structures over the fast ones. If you ever need to go back and tune the system for scale, you will be making so high a profit it will be a problem of luxury.
Some napkin math: A typical customer account runs a single contract. A contract is around 2 Kilobytes of memory in Erlang. I have 244 Gigabytes of memory. We will run out around 128 million paying customers. Assume profits of $1 per customer and, … this is not a problem.
The second key observation is this:
We can arrange the system such that we take all the blame for errors.
That is, whenever we are in doubt, we give the benefit to the customer. Mark wrote about this when he decided to handle rounding errors by giving customers the fraction which cannot be divided.
But there are many places where this principle can be used to simplify the code base. If a given feature costs $30,000 to implement but only yields you $50 a year, then the effort of implementation is clearly not worth it.
This principle gives us more freedom and flexibility in the implementation because we can handle certain problematic areas by just ignoring them while measuring their impact.
The world is not perfect and faults will happen
Our system is systematically built to acknowledge certain subsystems will fail. And the approach is to regard that we are always in a state where the subsystem failed. More on that later.
We store all data in a flat file on disk. Data are stored as Erlang terms in a pseudo-human-readable format (think XML, S-expressions or JSON). This is deliberate. It is much easier to read and verify.
The current state is handled by a built-in database in Erlang called Mnesia. This database stores an in-memory current view. Provides a transactional store loced to the on-disk log and so on. It provides a cache of what is in the flat file on disk. We can always throw away the database and replay the log from the dawn of time (which is somewhere around Nov. 2011 for this system). In principle Mnesia can also bootstrap the system faster since we can skip over large parts of the log, but we have not had the reason to implement that optimization yet. A reboot just replays from the start.
The Mnesia part also explains how we can do reporting. We have a database, almost Relational in nature, which we can query to extract reports. The PITR-property of the event log also lets us answer questions back in time. But for simplicity, we just store all data persistently in Mnesia in classic OLTP-manner.
Other storage options
One option we have thought of is to keep a flat file per contract and one global file. The advantage is we can then boot contracts on demand when we need them. The disadvantage is we loose an important property of the current system: linear time ordering.
The current system forces an ordering on all events and only forwards events in that order. This simplifies almost all of the code base. Distribution afficianados might claim that this will be a problem for performance. But due to the primary key observation, we can ignore it.
The other option is to introduce a database to run the event log. Postgres is an obvious candidate. We have not given this much thought, but it may be an option and simplify some parts of the system.
Time in our system is handled roughly like time is handled in Moonpig. Time is injected or pushed into the system. It is never pulled by the system. Whenever an event happens, there is a time parameter which is sent into the system and tells us what time it is. Functions inside the system pass on the “current time” and we never call erlang:now() nor os:timestamp() inside the code base.
The consequence is that testing is now suddenly possible. We can quickly play out scenarios where time is forwarded a couple of years. The test system controls the time fully, so it is easy to play out what-would-have-happened scenarios.
When running normally, we inject time at the boundary of the system when some event happens. In effect, other users do not have to cope with the fact that time is injected into the model. But from a testing perspective, we can just pick the variant of the call where time can be overridden.
Current time & contract time
A distinction from Moonpig is how we handle time in contracts. In Moonpig, time is handled by heartbeats and it is the responsibility of subsystems to ignore heartbeats which were duplicated etc.
In our system, contracts have an internal time which we can call C. If T denotes current time, we always have the property that C < T. That is, the contracts time is always behind the current time by some small epsilon value. The scheme where the contract always lags behind real time allows us to handle heartbeats in a slightly different manner. When an event is injected into the contract, a timestamp T is also sent. The contract now forwards itself to the next point in time where something interesting happens on the contract. This makes sure duplicate injections of time and events are ignored by the contract.
This model also describes how we handle problems where the system fails. We have backup procedures. We can just restart the system a couple of days later, and it will forward time and do things correctly. If anything important should have happened on the days in between, they will be handled at this point.
Note the implication: our system is always in a catch-up mode. It always thinks the state of affairs is that it is behind and need to do something to catch up to the current situation.
We never got around to do QuickCheck models for contracts, but there is an interesting property of contracts which is worth mentioning: A contracts state is derivable from a stream of transactions. Such a contract should be indifferent to heartbeat-events and other events types that happens in between the transactions. This can be probalisitically verified by an Erlang QuickCheck model. But we did not get to do that.
A very important aspect of our system is Idempotent “best effort” delivery. When contracts want to have something done, they send out an obligation. These are picked up by agents and then handled. Obligations may get lost. They are periodically restated by the contract. If an Agent crashes, it doesn’t matter that it did not handle all the obligations. We will catch up eventually.
Agents track which obligations they have already fulfilled. If so, they idempotently produce the same answer as before as a transaction which is then sent back to the contract. Thus, if our event log system crashes, the agent can just send in the transaction again.
Charging works on this system as well. A charge contains a unique reference and it handles transactions idempotently. So it doesn’t matter if the charging system sits in the other end of the world and we lose the network connection. We will just retry the charge with the same reference. Since our end handles the unique reference persistenly on disk, we can’t generate another reference for the charge. And the other end can safely store a transaction in their database.
Many areas can be handled with a fire-and-forget type of message. We only try to send mails once. If the underlying system accepts the mail, we idempotently mark it as “done and sent”. This minimizes annoyances on customers, should we have to try to resend mails.
Note that each agent is usually very simple. Often they are less than 40 lines of code. And they can be checked individually, without the rest of the system.
We only have to worry about progress, but that can be measured. Our system extensively uses the folsom system in Erlang to track counters and gauges so we can see what is happening inside the node. In fact, we track almost everything with probes in the system. Full instrumentation is something we strongly believe in for every new system written.
Like in Moonpig, we use immutable constructions all over the place. We never throw away data, but record it into the global event log for future replayability. This is also what we use when something goes wrong and we need to figure out what went wrong.
Immutability is also very powerful when one considers debugging. Since we have every transition recorded and full PITR-support, we can essentially always rerun the business rules of a contract and see what went on inside it.
Replayability has an interesting impact. When we upgrade our code in non-compatible ways, we also have to mention that upgrade in the event log. So old versions of the code still lives on, but is only used up till a point. Then it is switched with new versions of the code base. Essentially the system switches between different contract versions over time.
Floats / Time zones
No floats! Only integers. The system stores everything in cents internally, but doesn’t do financial calculations. Our plan—should we need to do so—was to store everything in picodollars and calculate exchange rates for other currencies. Much like all time is in UTC and are offset from there at the boundary.
However, we do note that our billing provider expects floats, we have a boundary conversion going on… The mind baffles at times at the choices made in software systems.
What we skipped on
We did not implement a DSL for writing contracts. Rather, we wrote the few contracts we needed in (purely) functional Erlang code.
We don’t have a lot of code in place to manage blame. This is due to the contract simplicity currently in place. We don’t really need this.
We could probably improve the model if we had to do it over. I believe this is the way to do billing systems for the vast majority of companies out there. The code is neat and modularized into contracts, agents and persisters of data. Each can be implemented with a natural backing into Erlang processes, proxies for foreign subsystems and RDBMs systems and so on. I also note that a system like Datomic would be near-perfect to store the transaction log. And would have a nice scalability curve to boot.
The choice of Erlang is rather nice for a system like this, where contracts, agents and so on can be modeled as processes. Other good languages could be OCaml—for its expressive type system, or Go—for the goroutines.