Previously, I spend my Thanksgiving long weekend to build a Real-time Data Pipeline. In case if you miss it, here is the link.
In this part 2, I will focus on how to scale, how to extend the capability and how to make it ready for the Production.
First get it done, then improve it, the final, let’s push the limit.
In part 1, the pipeline is loading the mock data; however, in the real world, we have to prepare the pipeline ready for the upstream data in, which means the integration interface.
Let’s start with restructuring the code structure.
This pipeline has four different main functionalities
- The upstream system sends data in via API.
- Kafka produces the data.
- Kafka consumes the data.
- Save the data to the database.
Logical speaking, we combine those function as one single application, however for the scalability we want to separate as two individual applications, which is,
API + Kafka producer and Kafka consumer + insert to database.
New code structure looks like this:
We try to follow the structure as Golang app best practices.
- Folder API holds the app application and the Kafka producer.
- Folder app holds the Kafka consumer and Insert the Databas3. Folder cmd holds all the execute binary.
- Folder deployment holds all the docker deployment script.
Our end target is one runtime build time has two binaries built. And deploy to 2 different container instance (AWS, GCP or Azure, etc.).
The beauty of the separation of this two applications is, they will have their focus function, its easy for the operation, maintenance and deployment, etc. (simulate as the Micro-service architecture)
Technically, API stands for Application Programming Interface. From the business perspective, it is the layer to scale up and communicates with a different system to turn IT/Data to business benefits. So if we consider the Data is the new 21st century’s oil, the Pipeline is the Data Pipeline, and the API is the Faucet.
When we talk about building web-based APIs, you typically think RESTful APIs along with JSON as the standard for interchanging data between applications. This approach is excellent and mobile client applications can quickly consume these JSON based RESTful APIs
As my favorites, RESTful API code structure in Go looks like this.
RESTful API code structure
One of my favors programming quote is “Think twice, code once”, in most of the case, this is difficult, special when you are in the project mode. However this is one of my personal projects, so I have the luxury to think twice, and those personal project can lead me to have more experiences when I start to face the real business context project.
For this project, imaging if we are using HTTP POST, send those data in a consistency way, the server will probably get the performance issues, which lead to a common question, how to handle the millions request by sec. There is a good article about this.
In Go, it provides the convenience of service worker implementation easily, mostly because, goroutine is entirely different with the OS threads,
The Go runtime multiplexes a potentially large number of goroutines onto a smaller number of OS threads and goroutines blocked on I/O are handled efficiently using epoll or similar facilities. Goroutines have tiny stacks that grow as needed, so it is practical to have hundreds of thousands of goroutines in your program. This allows the programmer to use concurrency to structure their program without being overly concerned with thread overhead.
So which means, all the income HTTP Post request, the service worker will put them into the queue and the worker will dispatch to the Kafka producer.
Let’s unit test the API app. As usual, my focus is still on the 80% up of the unit testing.
The most fun part is load testing, because we can find out if our application can handle stress, and sometimes people will always mention, don’t worry we have the Cloud, we can add more instance just by a click, don’t worry about the loader, which is true, however, the trade-off is the costing, which I think most of the cloud provider won’t document this. :)
Cloud is good, Cloud can help us build, and ship to PROD faster, and easy to scale, but please keep in mind what run in the cloud is still your CODE, which means good code will help you more easy to scale and maintenance, but the bad/un-optimized code will cost you fortune. If you don’t believe, please try.
For our use case, the load test is most critical because we need to handle the consistent HTTP POST request. So we are using JMeter as our load testing tools. And my current MacBook Pro is 2.8 GHz Intel Core i7,16 GB memory.
Assumption our requirement is handle 500 requests per sec. So we double this number, and make the test case for 1000 requests per sec.
So we can see the CPU usage getting the average of 55%, and Memory usage is keeping really straightforward of the 60%, looks good!
By now, I think this app is ready for the Production, however the real-time pipeline API, RESTful is not the only solution here, we can also consider the UDP (User Datagram Protocol) or gRPC. I will take some time write those interface for this demo application. Stay tuned.