I founded a new tech-startup called Signal Analytics with an old University friend, Fedor Dzjuba of Linnworks. We are building a modern, cloud-based version of OLAP cubes (multi-dimensional data storage and retrieval) by building our own database system.
I am taking the lead on the technical side and I am most comfortable with C++ so decided to build our OLAP engine with it. I did originally build a prototype in Rust but it was too high risk (I should write another post to explain more about this decision).
A lot of my peers think it is bizarre that I am building a cloud service with C++ and not with a dynamic language — such as Ruby or Python— that provides high productivity to ship quickly.
It started to question my own judgement to use C++ and I decided to research whether it is good idea or not.
C++ is not a dynamic language but modern C++ (C++11/14) does have type inference. There are lot of misconceptions that if you write it in C++, you must code with raw pointers, type long-winded namespaces/types and manage memory manually. A key feature to feeling more productive in C++ is the auto feature; you do not have to type long-winded namespaces and classes; it uses type-inference to infer the type of the variable.
Manual memory management is the most popular misconception of C++. Since C++11, it is now recommended to use std::shared_ptr or std::unique_ptr for automatic memory management. There is a small computational cost to maintaining referenced pointers but it’s minuscule and the safety outweighs this cost.
The last part to being productive is having libraries to build a service/product rapidly. Python, Ruby and others have great libraries to take care of the common infrastructure. In my opinion, the current C++ standard library is severely lacking in basic functionalities and certain APIs have poor performance (for example, reading files from iostreams). Facebook has open-sourced high quality libraries that have helped us to quickly ship out alphas of our OLAP cloud service.:
This is a great general C++ library and has lots of high-performance classes to use. I use their fbvector, fbstring throughout our engine because it offers better performance than their std::vector and std::string respectively. We also use a lot of their futures and atomic lock-free data structures.
Facebook made a really smart move with their dynamic growth allocations by not using quadratic growth (this can be easily proofed mathematically to explain why it is bad). Their containers grow memory size by 1.5x instead of 2x to improve performance.
On a side note, reading Folly code has also made me a better C++ developer so I strongly recommend to read it.
Proxygen is an asynchronous HTTP server that is also developed by Facebook. We use Proxygen as our HTTP server that inserts and retrieves data as JSON to and from our OLAP engine. It allowed us to create a high-performance HTTP server calling our engine in just 1 day. I decided to benchmark it against a Python Tornado server and got the following results for testing with 200 HTTP connections on an EC2 instance:
C++/Proxygen =1,990,130 requests per second
Python/Tornado = 41,329 requests per second
Its API is more low-level and you will have to write your own HTTP routing but this is a trivial task. Here’s what our HTTP body handler roughly looks like:
Our OLAP engine is essentially a distributed database used to store and query multi-dimensional data. The engine uses Wangle as the foundation of an application server. All the logic are layered into Wangle handlers that are chained together to form a pipeline. It communicates with our Proxygen HTTP server to serve data queries and the nodes communicate with each other.
It uses a grid of servers that share the same (symmetric) binary executable so there’s no master/slave paradigm. Each server is a node that acts as both a master and a slave and uses a custom binary data protocol to pass data/messages to each other.
The only thing missing for our needs is fibers for cooperative scheduling of storage and querying tasks within the engine; however Folly/Wangle developers have an experimental version at the moment but it is not production-ready yet.
2. Hardware/Labor Costs
I quantified that 1 C++ server is roughly equivalent to 40 load-balanced python servers for raw computational power based on our HTTP benchmarking. Thus using C++ can really squeeze all the computational juice out of the underlying hardware to save 1/40 off server costs. I guess we could have written it in Python to start off with but, economically, it would be a wastage of labor cost and time because, at some stage, we would have to scrap it for a C++ version to get the performance we need. The Python code will have no economic value once scrapped.
To summarize, C++ may not be the most popular choice for a startup but I believe modern C++ can be a viable choice giving you near-C performance with high-level abstractions. I am worried about build times once the code base grows significantly but hopefully C++17 modules alleviate this.
I hope this post inspires others to look into C++ for their ventures.