Systems we built and decisions we made

Published in

Math Camp Engineering

17 min readJul 14, 2016

Co-authored by: Revant, Brandon Kase, Benjamin Garrett, Christina Lee, and Alison Kohl

Acquisitions are a great opportunity for new beginnings, but they also offer amazing opportunities for reflection. In wrapping up our operations at Math Camp, the engineering team had the opportunity to look back, project by project, at the problems and solutions of each iteration of our engineering pursuits. While we spent a lot of time on day to day maintenance, we also encountered instances of novel breakthroughs, scrappy patches, and usages of new technology that few other companies have had the chance to test in the wild. In considering our legacy, we wanted to not only open source the code for some of these projects, but also walk through the decision making that led us to each solution. A lot of our decisions were made to optimize for quick iteration and maintainability, and as you read through this post, you’ll find technologies marked as Chosen or Built to explain the thought behind each. We hope that these solutions and the methods we used to arrive at them may be helpful in future engineering pursuits.

Best,

The Math Camp Team

We have broken down the post into the following sections:
Client: Languages & Build, Data flow, Drawing data, Fetching data, Sending data, and Storing data
Server: Serving architecture, Language & Repo, Auth, Experimentation, Notifications, Storage, and Processing data

We also have a list of the libraries we are open-sourcing here in case you just want to look at the code.

Client

Languages & Build

Swift (Chosen): We started using Swift in July 2014, a month after it was announced. We really enjoyed using Swift, but there were definitely some difficulties along the way. Getting good at writing Swift takes about 6 months, dealing with Swift language updates was difficult, and prior to Swift 2, compile times were about 10x slower than Objective-C. Additionally, many times we ran into incorrect code that successfully compiled, correct code that did not compile, and even correct code that segfaulted the compiler. However we did make good use of the modern features of the language: Protocol-oriented programming for code reuse, Sum types (and the included Optional), and Zero types to name a few.
TL;DR: Working with a new language had its inherent challenges and cost us time in refactors for version changes and slow compilations, but working in a modern language made us better and faster developers and allowed us to write clean and efficient code not otherwise possible in Obj-C. We were very happy with this decision and would make the same choice again today.
Kotlin on Android (Chosen): Kotlin is a terse, Swift-like, functional programming language on the JVM platform with high Java interoperability. Kotlin is unique in that its standard library and runtime are extremely small in bytes, yet very powerful. Most is eliminated at compile time (in contrast to Scala), so it can feasibly be used in memory “constrained” environments like Android. We wrote a ton of dependency-free extensions and operators in a library we call Cheatsheet.
TL;DR: We’re really glad we made this decision, but build times suffered early on. This is being fixed in Kotlin going forward, so may not be applicable.
Buck with Kotlin (Chosen): When we were writing a lot of Kotlin, Gradle builds were very slow (although this may have improved since then). Hacking Kotlin into Buck gave us a 5x speedup in builds.
TL;DR: Build times vastly improved, but setup took several days and a lot of intricate tooling.
TypeScript (Chosen): When writing our React Native app, we wanted to have some type safety to statically catch errors. We chose to use TypeScript, which compiles to plain JavaScript but allows for some type errors to be surfaced in the IDE. Early versions of TypeScript were moderately useful, but left a lot to be desired. As of the last update (1.8), the technology took a huge step forward and offered much better type mismatch alerts.
TL;DR: TypeScript helped us write correct code and the language continues to improve, but we wish we had spent more time investigating Flow before choosing TypeScript.

Data flow

We used streams, futures, and observables instead of async tasks, callbacks and async queue dispatches. This post explains in detail our thought process behind these decisions.

Streams (RxJava/RxSwift) (Chosen): From the README: “[Rx] extends the observer pattern to support sequences of data/events and adds operators that allow you to compose sequences together declaratively while abstracting away concerns about things like low-level threading, synchronization, thread-safety and concurrent data structures.”
TL;DR: We were overall very happy with this choice. However, there is longer ramp-up than first party async concurrency primitives, and if you don’t understand what’s going on under the hood, you can get bitten by unexpected behavior.
One way data flow: In both our iOS and Android apps, we introduced UI architectures that were based on unidirectional cyclic data flow. The reasons for this were numerous, but key among them was the desire to easily incorporate data from local state, user input, and server updates without complicated merge logic. Designing our system with these Cycle.js inspired components also allowed for us to offload complicated logic from activities and viewcontrollers into more easily testable and composable components. Both the iOS and Android implementations of unidirectional data flow paradigms relied extensively on Rx, which meant the learning curve was high for any team members not previously familiar with Observables. Also see: Cyklic and Dogtag
TL;DR: Adopting unidirectional UI architectures made for code that was easily modifiable and very transparent to reason about and test, but is likely not feasible at scale unless engineers have significant time to become comfortable with drastically different paradigms.

Drawing data

Cyklic (Built): Cyklic brings ideas that are starting to become popular on the web to native Android. Cyklic is a discrete functional reactive UI component framework inspired by Cycle.js (similar to the elm architecture). Using these architectures leads to (mostly) pure application logic. Pure code is simpler and easier to reason over. Adopting this framework helped us find edge cases and gave us the nice “if it compiles, it works” property for our UI components.
Dogtag (Built): Dogtag builds off the work done in Cyklic to bring unidirectional circular data flow to UI components in iOS. It demonstrates a method of organizing data that allows you to offload complicated logic from ViewControllers into a model and view model layer. The main responsibility of ViewControllers becomes binding view variables (the alpha value of a button for example) to their corresponding element in the UI.
React Native (Chosen): React Native enables you to build native mobile apps using JavaScript. One of the biggest advantages of this is that you can send down JavaScript code for components from the server. This enabled us to ship code and fix bugs without going through the app release cycle. Unlike most React Native users, our app is not React Native with a little bit of native code — it’s a native app with a little bit of React Native code. We built a few screens of our app in React Native and also created a React-Bridge so that the JavaScript code could interact with native app components. We used React Native for instant over-the-air updates for minor UI tweaks and bug fixes so we didn’t have to wait for App Store approval. Overall we were happy with these benefits, but it had downsides. For example, React didn’t perform well for long lists of things (no view cell reuse yet) and debugging some issues was a bit tricky.

Fetching data

Server controlled onboarding pipeline (Built): For fast iteration of onboarding, we built an instruction pipeline which would be sent from the server and processed by the client. Instead of sending down the items and writing the business logic on the clients, we decided on a set of instructions that each of the clients would implement and wrote the business logic on the server, which parsed everything and sent down instructions to be processed sequentially, just like CPU instructions.
This worked out great for us and enabled us to test multiple different onboardings with ease.
Adaptive image download (Built): We recorded the length of time it took to download and display every image or video that the user sees. This allowed the app to get a sense of the user’s overall experience — does it feel fast or slow? Are they seeing loading indicators or not? If the moving average of media display times reached a certain level, then the app would start downloading lower resolution media. Once the user was in faster network again, the moving average would drop enough that the client would switch back to full resolution. This system made the app dramatically more usable in varying network environments.
Mobile config (Built): We built a server-controlled config, which the client would fetch on app load. It was a local key-value configuration which could be overridden from the server. It enabled us to change how the app behaved without pushing an app update. This also enabled us to run experiments and change the look and feel of the app on a per user basis — for example, we built the app so that the client could get a font, size, and color for a particular text element. For those of you who want to use mobile config, firebase just released something very similar to it.
Retrofit & Alamofire (Chosen): We used Retrofit for our REST client code on Android and Alamofire on iOS. We chose these mostly because of the strong community around these libraries and Retrofit specifically because our server APIs were RESTful. We had no problems using these libraries.
Image caching (Chosen): In a mobile app, if you don’t nail your cache code, your users will suffer. Excessive networking causes both battery and data-plan drain. We can help ensure a clean correct implementation by combining caches. Choosing the right image fetching and caching library is tricky. There are plenty of them out there, but somehow none of them are completely satisfactory (or were when we looked).
On Android, Picasso has a nice API and is very simple to use. However, it uses the standard HTTP cache and HTTP cache headers, so the caching layer isn’t very flexible. We decided to use Volley by Google as it was easier to play around with the caching layer. If you don’t have to do that, we would recommend Picasso.
On iOS, we used Carlos, a composable caching library. Consider what it means to be a cache. You need to be able to (1) associate a key with a value and (2) get some value given a key if such a value exists. That’s basically it. Caches tend to appear in layers. In a CPU, memory reads check L1, then L2, then L3, then RAM. When we want to load an image, we first check RAM, then disk, and finally network. Given two caches A and B, A `on-top-of` B means first check A, fallthrough to B, then write back to A. Now we can define a monoid for caches. Monoids imply easy composition. Easy composition means reasoning about our code becomes easy. Carlos is nice because you can reuse the same network cache logic across not just images, but other binary assets and JSON data. It is also very easy to decouple what would be complex performance logic from the actual caching logic.

Sending data

Upload pipeline (Built): A major part of our app was uploading photos and videos that users shared. At times users would share tens of photos every session and we wanted to make sure all of them got uploaded. We put all the media to be uploaded in a queue and split the upload process into atomic stages: ready for upload, formatted locally, generated cloud storage URL, uploaded to cloud storage, and finalized on our server. Our upload pipeline was a state machine backed by disk, so even if the app was killed in the background, we could finish the upload later. If a particular stage failed, we would put it back in the upload queue. Recently, Android launched a job scheduler which should make implementing this easier (on Android), but we didn’t get a chance to test it.
Offline sync (Built): We wanted our app to be usable on poor and non-existent networks. First of all, you need to make sure you prefetch often enough. The data needs to be in the app even if at the time you open it there is no network. Our prefetching system ended up becoming a variant of our Carlos caching system (see above). Secondly, if you swipe a photo, you should not see it again. Ever. We built a system that ensured certain requests would eventually complete successfully. So we didn’t just hit the network for these requests, but we also stored the information persistently on the client so that, for example, we would know not to show you the same photo twice.

Storing data

Fiberglass (Built): Lightweight shared preferences in Kotlin. Delegated properties in Kotlin not only allow you to execute arbitrary code when a field is accessed, but also package up the code for reuse. You can also reuse the same delegate for different properties. We used this extensively instead of Android’s SharedPreferences. Take a look at the readme for more details.
Lightbase (Built): We built a Swift wrapper around SQLite to be able to store and load structured data from disk easily. We had tried some of the other databases around, but they either had complicated usage patterns or caused crashes.
StoredObject (Built): Similar to Lightbase, we also wrote stored object for android. It uses SQLite as the backend but allows you to have an arbitrary schema. This made storing and loading objects from disk very straightforward without having to deal with any SQL boilerplate. We also have an rx version of it.

Server

Serving architecture

App Engine (Chosen): We used to use AWS EC2 VMs with our database on RDS and Dynamo and wrote some libraries for it (flywheel, pypicloud, dql, steward), but we realized we were spending too much time just maintaining our web servers. We wanted to move to a solution where we didn’t need to do any of this and were deciding between containers and App Engine. Two years ago, containers didn’t really have a good hosted platform (like Google Container Engine), and even though they were somewhat simpler, it still would have been a lot of work to manage. App Engine, with its recent additions (modules, versions, and queues) and its awesome hosted and fully managed cloud datastore, was the winner. We’ve been very happy with App Engine, as we’ve had to spend minimal time maintaining and scaling our server stack. You can also stream App Engine logs and datastore entities to BigQuery, which makes doing analysis quick and easy.
Go-restful (Chosen): We decided to go with go-restful which uses the standard Go net/http server but adds nice routing on top, specifically the ability to define REST services and generate Swagger documentation from them.
Swagger (now known as the Open API Initiative) (Chosen): Swagger is a system that understands REST APIs. Our server framework outputs information about our REST APIs in the Swagger specification format. We use Swagger-UI to visualize and test our APIs. We even use it as a dashboard where non-engineers can poke our backend.
TL;DR It was a bit cumbersome to get Swagger setup for the first time, but after that it worked great for us and was heavily used.
Paging (Chosen/Built): When sending a large (or infinite) list of data to a client (and since the client can only view part of it at a time), the right thing to do is to break up the list into chunks (or pages) and send the pages one at a time to the client instead. This way the client doesn’t have to wait for a ton of data before being able to view it, and the datastore doesn’t have to return too many items at once. We implemented a paging solution on AppEngine similar to the design of the Facebook Paging API. When you receive a page of data, you receive in JSON: The list of data along with `next` and `prev` urls to get the next and previous page respectively. The URLs themselves use the query parameters: “limit”, “since”, and “until.” “Limit” is the max amount of items to return in this page. “Since” and “until” refer to the beginning and end of the range of data that we would like to get (both are optional, default is now). Usually `since` and `until` are timestamps — however, our system can also alphabetically sort by some parameter of your data and provide that information to since and until.

Language & Repo

Go (Chosen): Once we had decided on App Engine, we had to choose one of the languages compatible with it. We decided on Go because it has goroutines, compiles and executes quickly, is easy to learn, has great tooling, and is great for IO intensive work. Goroutines make it straightforward to write concurrent IO calls, which is usually the bottleneck for web servers. We wrote a small forkjoin type library for Go called concrunner, which enables running a fixed number of concurrent requests and combines the errors and results from them. However, the lack of generics (or something with a similar amount of typesafe abstraction) is an issue at times.
Repo structure (Chosen): For Go, repo structure implies package structure. We used a monolithic repo for both the client and the server code. Inside our server codebase, we had the following directories: api, controllers, io, and models. api contained all the routing and rest service definitions. controllers contains all the business logic as functions which can be tested — for example, the usercontroller package would have a method like CreateNewUser(ctx, name) -> (id, error). io contained code for performing io on one particular model — for example, userio had methods for accessing the user object from the db/memcache. models contained the model definitions (structs in Go) for each type, and also contains any transformations which don’t require io. After we had used Go for a while we realized that it is extremely useful to also nest packages under controllers, io, and models so that, for example, `models.UserId` becomes `user.Id`.
TL;DR: This worked great for us. It was very clear for developers when they were hitting the db and when it was just a local call. It also made it easy to organize and test code.

Auth

JWT(Chosen): We decided to go with JWT (JSON web token) for authentication. JWT is great because you can just decrypt the access token to confirm the identity of the user instead of verifying by looking up in the database. For those who don’t know what this is, it allows you to base64 encode and sign a JSON object (which contains the userid and any other metadata) using a secret key. You can then send this string down as an access token to the client on first login. Consecutive requests send that token and you can get the userid from it without hitting the database.
Magic Link (Built): We wanted to reduce friction for people signing up from invites for our app Roll. We built Magic Link so that if a user clicks on an invite link and installs the app, they would automatically be signed in. Since we were using phone numbers for login and for sending the invite, we signed in the invited user using a special link. The invite link sent to a phone number contained a code which would get sent to the server after app install and would act as a verification token. For passing on the code between the link click and app install, you could use an install tracking analytics tool. (Google Analytics also has this.)
TL;DR: It was magical when it worked for people but sometimes people wanted to change their phone numbers so we added a confirmation step before automatically signing them in.
Phone (Built): Since we used phone numbers for authentication we built a phone parsing library for go. It figures out the country and whether the given number is valid for that country. Another neat thing about it is that all the country definitions are given inside country-phones.json, a file which can be used on the client for phone number formatting as well.

Experimentation

Yatz (Built): We built our own A/B testing framework for App Engine in Go on top of logs and BigQuery inspired by BigBingo. Yatz allowed us to do A/B testing without making any extra network requests. It can be broken down into three parts: Managing experiments, logging events, and analyzing the events. The experiments and variants were stored in an active_experiments file that got pushed to our servers. An event along with all the user’s experiment variants was logged with standard App Engine logs. These logs were streamed to BigQuery, an optional feature of App Engine. We then wrote scripts which made it easy to analyze this data using BigQuery. A nice feature of Yatz: Choosing a variant involves merely hashing an id, and reporting events is just logging. Therefore, experimentation with Yatz adds essentially zero overhead (no I/O necessary).
Feature gating (Built): Rather than littering our codebase with version checks (if version >= 1.0.3), we used feature checks (if ctx supports tagging feature). That way, when you go back to look at the code again, you understand the intent of the branch. Under the hood, it is implemented using app versions, but by abstracting the version checks out of the application logic, our code became a lot cleaner.

Notifications

Notification Engine (Built): Sending the right amount of notifications at the right time was very important to us, as well as the ability to tweak and test those things. We built a system which made batching and scheduling notifications super easy. Sending notifications like “Alice and 5 others liked your photos (5 minutes ago)” requires very minimal effort. Take a look at this blog post to learn more.
Google Cloud Messaging (Chosen): We used GCM to send notifications to both iOS and Android. You can send notifications using the same REST calls and don’t have to deal with APNS sockets. App Engine doesn’t allow persistent socket connections, so sending notifications to iOS devices via GCM was much faster as well.

Storage

Google Datastore (Chosen): We wanted a flexible, scalable, and hosted database. We used to host our own database, but managing it and scaling it became a full-time job and we wanted to move away from that. After looking into some of the hosted databases, we decided to go with the Google Cloud Datastore. It’s a NoSQL database which supports transactions and indexes, which was exactly what we wanted. It was one of the things which made us choose App Engine as our serving architecture. It is flexible and allows you to add fields to your structured objects even after they have lived in production. It’s scalable and scales automatically. Since it’s distributed, it’s fast (~10ms query times), and it also supports limited transactions when you want them.
Storing data in keys (Chosen): Cloud Datastore keys are just arbitrary strings (just like any NoSQL database). Instead of just using uuids for keys we stored data in the keys themselves. This enabled us to enforce uniqueness for the stored objects and also meant less database calls. An example of this is our Edge structure which is used to store any relationship in a social graph. An edge has a source, a target and a type. Our edge key was just a concatenation of all those 3 things “src:type:tgt”. This way we didn’t have to check for an existing edge before storing a new one and also made all edge queries fast as you only needed to fetch the keys.
Using key only queries in datastore (Choice): Key only queries in datastore are free and so is memcache access. We used both these things to bring our datastore costs down. #1 We stored data in keys (look above), #2 We stored objects in memcache, #3 we would do key only queries and then fetch the data if we wanted to from the memcache.

Processing data

Data migration (Built): Sometimes you have to change things in all objects of some type that you have stored. When there aren’t many objects, it is fast enough to just serially iterate through all of them and apply transformations. However, this becomes unusable as the number of objects grow. We built a simple migration tool on top of AppEngine which allows changing these objects in parallel in Go. It streams over all the keys, batches them and schedules each batch on an AppEngine queue. Each batch fetches the objects for those keys, deserializes them into their Go struct types, runs a transformation, and returns storable objects. This system is a bit tied into our codebase so we haven’t open sourced it yet but hope to do so soon.
Distributed ForkJoin for AppEngine (Built): We built a distributed forkjoin using memcache in AppEngine for running batch jobs and combining the results of those jobs across machines. The way we structured it, the last job would know it is the last one and could do whatever it wanted with the combined result. We plan to open source this soon.