One of the books I started reading during my sabbatical was Domain Driven Design: Tackling Complexity in the Heart of Software, by Eric Evans. I wanted to start learning more about Software Architecture and higher level design principles. The books is full of useful, actionable insights and I highly recommend it (though it is quite long). The main goal of the book, is to help the reader understand how to build the right models and abstractions for their given business domain, and use succinct but precise language in describing those models and their business rules / interactions. Reading this book, while at the same time learning GraphQL, caused me to think about API design in a totally new way.
GraphQL + ArangoDB = ❤
Many people view GraphQL as a replacement to REST, due to API discoverability, data fetching efficiency, easy deprecation, etc. And while all of those features are true and valuable, I think the real value of GraphQL is in its utility as a design tool.
GraphQL is not actually a graph database query language, which is confusing to some at first. It is agnostic as to how you actually store your data behind the scenes, which is important. A single GraphQL query can pull data from many sources, and can even be incrementally adopted by wrapping an old REST API. GraphQL allows a developer to model their domain as a combination of Types, Queries, and Mutations (along with Subscriptions which I will not cover).
Using a custom database, instead of DynamoDB is a serious technical decision that must be justified by the business goals. For a social networking, recommendation service, a graph database is a perfect fit. If your business has similar requirements, using a graph database can give you a huge competitive advantage by allowing you to offer features that are nearly impossible with a relational or standard NoSQL database.
I try to follow the SOLID principles as closely as possible when writing code. Therefore I wrapped the ArangoDB driver behind a more general GraphDatabase interface that the GraphQL API can use. This also allows the user to supply a JSON configuration to initialize the graph database topology, collections, indices, etc. The developer ideally should not even know they are using ArangoDB, and in the future, if needed, the storage engine can be swapped out for some other graph database with no affect on the API code. This package will be open sourced as an npm module when I find time to write the tests and clean it up. ( or someone steps up to help me )
Another great benefit of adopting GraphQL, is the tooling ecosystem. Once you define your entire API as a statically typed graph, you can do some interesting things with the introspection query. I added 2 routes to help with developer on-boarding at /graphql/playground and /graphql/voyager. The first let’s you interact with the API to understand what it is capable of. ( similar to Postman) The second is an interactive graph visualization of the entire API! See below for screenshots.
Its important when writing software to follow scalable patterns, to avoid technical debt and increase the readability and maintainability of the code. I am a big fan of the Onion Architecture described here.
For example, our backend API follows this pattern in the following way:
Entities > GraphQL Types
Use Cases > GraphQL Queries & Mutations
Controllers > GraphDB Interface
External Interfaces > ArangoDB driver module
The only component that might be an issue for scaling is the database, as it is running on a single EC2 instance. The other components (S3, CloudFront, Lambda) have scaling built in. The good news is that ArangoDB uses memory-mapped files in it’s implementation, meaning as long as your database fits into system memory, you will get blazing performance similar to other stores like redis. Even on a t2.micro instance with 1 cpu and 1 gb of memory, the API was able to easy handle 10k concurrent requests. I used this tool to do the benchmarking: https://github.com/Nordstrom/serverless-artillery
ArangoDB has great clustering support which should be used if the app begins to gain traction. This is less for performance and more for resiliency, as the EC2 instance could eventually fail over, at which point the seconday DB server could step right up.
Graph databases are inherently difficult to shard, as there needs to be some knowledge of the business domain to do the sharding efficiently. Otherwise your requests will slow to a crawl due to the network hops between servers as the graph is traversed. Luckily ArangoDB has a solution to this problem called “Smart Graphs”, though it is for enterprise customers only. The good news is that AWS has EC2 instances with up to 512 gb of system memory with 72 vCPUs, so there’s tons of room for vertical scaling if your app starts to gain serious traction.
I would like to eventually merge my serverless-artillery branch, as well as create a separate test suite for load testing. As developers, we should know what the maximum load our deployed apps can handle is, and we should have data to prove it. This is not needed for the initial release however, I just wanted to get a rough idea of the current performance.
Serverless + Webpack
Our serverless.yml file declares all the functions that our backend API needs to accomplish its goals. Many people split up their API into many small lambda functions, which has its merits. I personally like the “Lambda monolith” approach, as I do not want the infrastructure to affect the way I write code. I just want to write my express app, which handles all HTTP requests and routes, and ship that in a lambda. For HUGE express apps this does not make sense, but again this stack is tailored for the indie hacker / small startups. You can always split up the express app by routes later into separate lambda functions, but I think that is a premature optimization.
I also am a fan git mono-repos, even for large projects. So splitting up the backend into all these little lambda micro services doesn’t make sense to me. Perhaps for large distributed enterprise teams this is a good idea. Each lambda function should definitely be its own npm module however, to enforce clean boundaries. Thanks to yarn workspaces, this is easy.
We use webpack to transpile our backend node.js code with babel, as well as bundle all the other necessary assets (email .html templates, .graphql files, .AQL queries, etc) We can also ship standalone binaries with our service, which we are doing for our database backup and restore service. You usually want to compile the binary yourself on an AWS EC2 instance running Amazon Linux, but knowing that Amazon Linux is based on Centos7, you might be able to get away with just shipping the Centos7 binaries. For the arangodump and arangorestore binaries, this seems to have worked.
What’s more is we can actually generate flow types from our graphql schema! I cannot stress enough how awesome this is. This not only makes implementing the API easier, but can create an even tighter coupling between the backend and frontend as our React app can import the types. I am even thinking about just using Flow on the frontend and ditching the prop-types package all together.
Our deployment is pretty simple, as we just replace the lambas all at once. In the future we can allows for canary deployments which provide even more protection against mistakes and outages. For small projects with little traffic, this is another premature optimization. The good news is the Serverless framework will most likely have a plugin to support this in the future as AWS now natively supports it in API Gateway.
Follow the steps in the README to deploy both the staging and production backend APIs. We will need these APIs for the React frontend which we will now deploy in Part 4.
Here is another excellent blog post about GraphQL + React using the apollo client.
To see what a real production app looks like using these technologies, checkout https://github.com/withspectrum/spectrum
Read this amazing blog post if you want to really understand graph theory.