From Event Storming to Architecture
Welcome to the fourth post in a series documenting a the evolution of a learning side-project of mine: analyse the variation of a cryptocurrency price in relation to events about it, such as a milestone release.
- Post 1 introduced the ultimate goal and its motivation.
- Post 2 went over the quality attributes to focus on, considering the data nature of this project.
- Post 3 described the Event Storming process I went through.
This post gets us “closer to the metal” and describes the choices I have made in order to turn this project into reality.
After multiple iterations, the Event Storming sessions settled for four different aggregates, which I mapped to two independent services. One will be collecting and cleansing data, while the other one will train and apply prediction models.
In earlier model iterations, I was planning to have 4 different services: the coin observation was separated from the crypto event service and the ML training was separated from the prediction. Ultimately, that was overly complex and only two services are present now. This simplification is a result of iterating over the model and event-storming it. Let’s go over some of the choices I made.
Crypto sources. The goal is to correlate crypto events with the variation of the relative cryptocurrencies. Fortunately, all needed information is available for free (at least at the time of writing). CoinMarketCap is the leading website for parameters such as volume and price and offers a free API. CoinMarketCal gathers a series of crypto events, classified by categories such as “Community Event”, “Fork/Swap”, “Release” and so forth. This is great for my purposes — admittedly, this is the website that first inspired me — but at the same time it puts a heavy dependency on this particular website. Shall it stop working, I will need to find another data source.
Programming language. My favourite language is Scala and is the one I’m fastest with. Moreover, it gives you superpowers if combined with Akka, which provides all abstractions needed to implement reactive systems: products that stay responsive under any condition of load or failure, while keeping complexity at bay. One of the most powerful Akka modules Akka Streams, which offers high-level abstractions to work with a stream of items. We’ll look at it from closer.
Cloud repository. Every codebase need to have a nest. If you want to host it in the cloud, rather than on your own server, the three main choices nowadays are GitHub, BitBucket or GitLab.
GitHub is the most popular one, especially in the open source world. Unlike many developers out there, I wasn’t bothered by the Microsoft acquisition (actually, well done Microsoft!), but in the end I chose to host this project on BitBucket as I’m especially curious to try out their pipeline feature. I have never been too comfortable with the entire CD/CI world, and the promise of an easier way is very attractive.
CI/CD pipeline. As just mentioned, I am going to explore BitBucket and its pipelining features. I have used Jenkins in the past, but never really loved it. The Chuck Norris plugin made it slightly more fun to work with.
Monitoring. Unless you want to go for Lightbend’s commercial suite, the de-facto standard for monitoring of reactive systems is Kamon. While it targets any generic JVM software, a couple of side modules make it aware of all asynchronous contexts to be found in Akka. It couples nicely with other open solutions like Prometheus and Graphana, which is what I intend to use.
Deployment. The choice here won’t be surprising. Docker and Kubernetes are the winning team. I don’t strictly need any of the core features that made both famous — deploy anywhere, auto-scaling and so forth. A classic VM setup might be cheaper, but I am comfortable with the containers & orchestration setup and I even talk about it at conferences.
Cloud service. I have experience with AWS, but somehow feel more at home with Google. The AWS documentation seems verbose and it always feels like I’m looking for a needle in a haystack. The infrastructure I use for my talks’ demos is on Google, so the more I practice the better.
Inter-service communication. In earlier iterations I had chosen Kafka as bridge between the decoupled services. In reality, I am not going for a strictly event-sourced persisted model. Kafka would have been great in case I needed log replays, but in this case a simple queue is enough and adds less complexity. Managed Kafka instances are expensive, even for the low rate of messages that I expect, which is another factor in choosing another solution. Considering that my services will run on Google, Pub/Sub is the most natural choice.
Persistent layer. For what I can see at the moment, I need to persist crypto events, coin measurements and the trained models. My cloud choice for simple entity storage is Google Cloud Datastore, cheap and reliable. For what concerns the coin measurements, I studied the differences between all of Google’s storage options, and BigQuery seems to fit my case best. There is a Java SDK that will hopefully fit just fine. I am not sure how to best store trained models just yet, which is why the architectural diagram shows a question mark.
Machine Learning. This is where I start to feel in the exploring zone. I have had experience with Spark, plus I was sitting next to Holden Karau at a speaker’s dinner recently, so naturally I went and checked out Spark ML. Aaaand, I dunno. I stumbled across this other library Smile, and I think I will experiment with this one first — at least it won’t give me headaches with Scala 2.12, which is still a lingering topic in the Spark world.
The sum of these choices smells a lot like vendor lock-in, but in all honesty I feel this is kind of inevitable. I know already that changes will be needed once development starts. So follow along for the first implementation steps!