Over a year ago in 2020, the team at Infinite Lambda was tasked with creating a digital core banking system for one of our clients. The system had to fit the budget and business needs of enterprises of various sizes, from small-scale start-ups with a few customers to really big banks with millions of customers and transactions.
Being the ambitious team that we are, we set out the following characteristics and principles:
- Traceable: it was mandatory that our solution justified each balance in the system and produced each event that led to it. After all, each balance is the net result of all transactions on the account.
- Horizontally scalable: the architecture must allow for elastic scale out to meet demand and scale in to preserve cost.
- Optimised: before relying on horizontal scaling to meet demand, we had to ensure that we had the architecture in place which was already processing events in the most efficient way. A more efficient architecture and algorithms could achieve 100 times greater scaling! In some ways, this is more important than horizontal scalability.
- Highly available: it goes without saying, but if you can’t use your card to buy yourself lunch (remember when we did that standing in a queue? 🤷) then you will lose trust in your digital bank account very quickly.
- Automated: the secret to going fast is automation. We don’t just have automation; it is the only way of doing things: from CI and CD to client contract validation.
- Maintainable: from diagnosing a problem, to releasing a fix and getting feedback all in the same day, maintainability is critical to success.
- Idempotent: In order for our solution to be truly cloud native, it is not only transactions and payments that need to be predictable but also workloads. Idempotency here means that given the same input, we get the same output.
Why Event Sourcing
Keeping all of the above in mind, we agreed on an architectural pattern that could accommodate all of the above and much more, viz. еvent-sourcing.
The gist of event sourcing is the fact that each change is tracked via an event and the series of all events represent the single source of truth. These events are then consumed by different processors which create their own projection of the truth.
If you think about it, a bank account ledger is no more than a list of all transactions that have ever happened. Does it sound familiar? Yes, event sourcing is a way of implementing a banking ledger, so we took this idea and ran with it.
The Complete Solution
Next, we needed to figure out the event storage, i.e. how to persist all of these events and we also had to make them available to all components of the system quickly and reliably in a reproducible way. Kafka was a natural choice in this scenario because it:
- Has some properties of a Pub/Sub system;
- Can persist published messages for a very long time;
- Is highly available, which makes it a great fit for the cloud;
- Supports advanced concurrency in the form of partitioning;
- Has a guaranteed order (within a partition).
Having decided on event sourcing and Kafka, we had all the pieces and “just” needed to figure out how to put things together.
A critical part of creating a digital banking system is the ability to be able to approve transactions, e.g. when you are using your bank card, the merchant needs to ensure that you have enough funds in your account, so they make an authorisation request and await an approval. This is at the core of our core banking system.
We created two topics on Kafka, one to represent the authorisation request and one to contain the decision. Consuming from the request topic and producing onto the result topic is a highly optimised event processor which we call the Transaction Approver. It is designed to be the sole owner of each and every balance in the system as it processes all events from the request topic and can categorically and single-handedly make a decision to approve a spend transaction or not by processing each event sequentially. To make the Transaction Approver horizontally scalable, we partitioned the request topic based on who is the account holder.
The Cold Start Problem
Processing every event from the beginning of time on every single start up of the Transaction Approver is not desirable in a cloud environment where each individual instance can go up and down at any point in time. This can be also referred to as the cold start problem. We carefully considered this issue and solved it by using a system of persisting the state of the Transaction Approver to two different mediums: in-memory cache to allow it to go fast and a disk to back up the in-memory cache.
Here it is in action:
Each of the consumers of the downstream events produced on the ledger topic can create its own projections, e.g. the account service, which provides balances of account for the customers to view, just stores the most up to date balance for each account.
The Card Transaction Service on the other hand ensures that a prompt response is returned to the card processor.
Yet another use case is creating a temporal view of each account balance at any point in time to accommodate various reporting and reconciliation needs.
Efficient Event Storage
The curious among you might be wondering how to keep all events on Kafka. Some of you will have used text serialisation, which tends to use a lot of bytes and most of the time it does not matter. When you are planning to store millions of events for multiple years, however, the serialisation protocol does matter. Therefore, we choose to use Google Protocol Buffers for many of its advantages:
- Binary optimised format, meaning messages are extremely small;
- Spec-first approach with code generation for a wide array of programming languages;
- Open/Closed principle: the inner workings of protocol buffers means that each message that we create today is open for extension in the future but closed for modification, which is very essential to event sourcing;
- And finally, to show off some flex, with the protocol buffers and the cleverly designed messages for our ledger in place, we calculated that a typical bank with 5M active customers will not need more than 5GB of disk per broker to store 7 years worth of transactions.
It has not been discussed here, but one of the things you need to be aware of if you are considering implementing a similar system yourself is that you may face an issue when you are doing incremental rollouts to your services which consume from Kafka. A typical incremental rollout may trigger multiple Kafka consumer rebalancing operations, which you may or may not desire.
In addition, as I said earlier, sticking to the Open/Closed design principle when it comes to the design of the messages is key for event sourcing systems. This is due to the fact that all versions of the messages would be persisted and can be replayed at any point in time. Maintaining backwards compatibility in your message versions is important for two reasons. First, it will not litter the processor code with many if statements and second, it will ensure that when new versions of your apps are running alongside an older version, you will not find yourself failing to process critical event data.
We are now in production 🎉 and our system has been performing much to our expectations. However, we identified some improvements that we are going to work on in the future, such as:
- Making use of Schema Registry to make schema evolution easier;
- Working on finding the equilibrium between number of consumers and multi-threaded processing to achieve the greatest performance;
- And finally improving on the state serialisation in the Transaction Approver to also improve performance.
Thank you for making it all the way here. Head to the blog to find more technical insights that are going to make your engineer’s life easier.