While our editor’s choice Android app and multi-featured iOS app are the faces of Soundwave, there is a lot going on backstage to keep our applications up and running smoothly. In fact, ‘backstage’ is our in-house name for the suite of custom built software products that keep the Soundwave show on the road.
Soundwave backstage is a reasonably large system. In developing this system with a small team, it is imperative that we built software according to a small set of core values.
Keeping everything as simple as possible gives the development team focus to build what our users want. Simple systems are easier to understand. Where complexity is unavoidable in the whole, thinking of that whole as a set of simpler systems makes reasoning about that complexity much easier. This is why the Soundwave back-end has been broken down into a suite of loosely coupled products with a single well-defined objective. Closely related to simplicity is the concept of ‘good enough’. Good enough engineering is pragmatic. The resulting implementation may not be the most efficient implementation along some axis, or even the most cost effective in pure monetary terms, but it fits the requirements sufficiently well to deliver a quantity of value we are comfortable with. This doesn’t mean we are sloppy - ‘good enough’ is a standard we set for ourselves based on the needs of our users. Rather, it means we can put energy and focus into the things we care about the most, where we believe the energy delivers the most impact. A good example here is the contrast between outsourcing our user and song search facility to ElasticSearch versus rolling our own highly optimised implementation to facilitate the Soundwave map-search feature.
The strategy of quick feedback helps us determine if things are sufficiently simple (that is not too simple) and good enough for our users. Tactically, automation and continuous integration help us achieve this goal. This is complimented by a metrics everywhere philosophy. The Soundwave back-end has thousands of unit tests, and lots of automated integration tests that allow us to move quickly (but not too quickly) and with confidence (but not arrogance). We consider lots of metrics of differing types and use them as our window into not only the function but the value of each system. Deployment is fully automated, with minutes between a developer check-in and a deployment, allowing us to quickly act upon feedback with real-world change.
Before delving into the products we’ve built, it’s worth digging into some of the functional and non-functional aspects of the Soundwave ecosystem. Soundwave comprises of a smart-phone application that allows users to track music they listen to on mobile. Around this feature, we have built a user profile, twitter style social network and real-time messaging. Non-functional aspects can be broken down roughly into the categories of user experience and business intelligence. What we mean by user experience in this context is availability and responsiveness of the Soundwave application. By business intelligence we mean the ability for us to internally track information of what features are popular, vanity metrics, such as daily active users and success metrics such as users retained some time after sign-up.
Serving App Requests
Backstage API is the service that supports the Soundwave applications. With the exception of real-time messaging, the Soundwave apps interact with the Soundwave back-end exclusively through this RESTFul API. This application tier is stateless and therefore is easily horizontally scalable. An elastic-load-balancer ensures distribution of traffic across the fleet. There is some reasonably complex application logic built into the application tier. Examples include building an activity stream of plays, gathering stored information around a particular song or management of realtime messaging groups.
From the base backstage API framework have also evolved two other API’s that are used to ingest data and are run separately from the main Soundwave API. The first is the Plays API which allows us to accept tracked plays from various sources. This facilitates the Chrome browser plugin we have built that allows web play tracking to Soundwave. The Shine API is our endpoint for 3rd party data ingestion and is the point where users of the Soundwave Shine SDK for music data-gathering will push information gathered on iOS and Android.
Roadie is the worker framework that facilitates any Soundwave related work that happens out-of-band. In particular the roadie worker fleets are responsible for heavy lifting that happens in the background such as processing Soundwave plays, retrieving song-metadata, synchronising with YouTube and Facebook and processing data points around users’ usage of our app. This kind of work is typically network-heavy and therefore I/O bound, so there is very good return for a multi-threaded architecture. In simple producer/consumer style, the message dequeue is separated from the message processing. Furthermore, each thread processes a single unit of work, depending on no other threads for execution. This allows for much better resource utilisation on our worker fleet. For each different type of work, all that’s required is a simple message handler that implements a simple interface and contract. Everything else is handled by the framework. Our queue provider is SQS, and there a couple of caveats that are worth bearing in mind when working with SQS as a platform. Above all, SQS guarantees ‘at-least-once’ delivery. This means messages can be handled multiple times. Keeping the effects of message processing that result in database writes idempotent is important in this case.
Our primary datastore is a MongoDB cluster. Our current configuration is two shards of two replica-sets each. Our choice of MongoDB as a data store stems from it’s very good ‘geo’ support, its flexibility as a schemaless noSQL DB, good software support with mature Java drivers available, as well as Spring DAO integration. Furthermore, nice MongoDB features such as fast replica-set failover, automated backup and (eventually) the WiredTiger storage engine solidify MongoDB as our choice data-store for running the production app. Separate to the primary data store, which contains the application data, we also run a BI replica set. This captures metadata that allows us insight into how the Soundwave apps are used. With regard to media, such as user background or user photographs, these are stored in S3 directly.
built on top of the PubNub platform. With an SLA of 0.25 seconds from any point in the world to any other point in the world, this choice allows us to facilitate realtime conversations between groups of our users. PubNub has quite good client support for mobile on Android and iOS so we are able to rely on PubNub’s efficient message handling implementation to preserve data at battery on-device. One missing feature from the PubNub set is the ability to ‘fall-back’ to cloud notification such as APNS or GCM. Without this feature Soundwave users that had not opened the app were not aware that there was music, group invitations or chat waiting for them. We built our own product called Vibrato as a flavour of Roadie that listened in realtime to all users chat messages and determined if the user was ‘present’ or ‘not-present’. In the latter case Vibrato intervened with a mobile push notification, closing the communication loop between users that were available to chat in realtime and those that were not. Unlike roadie, Vibrato is stateful and scales horizontally by means of partition. That is to say instance 1 of Vibrato is responsible for some segment of users chats, with instance 2 responsible for some other segment and so forth.
At Soundwave ‘business intelligence’ is a broad category of quantitative metrics that are extrapolated from how our users work in Soundwave. In the vanity class of metrics we extract daily and monthly active users. With regard to feature usage, we collect data around things like how many messages are sent between Soundwave users on a daily basis. Going somewhat deeper, we store metrics on counts of users retained on Soundwave within 14-days of signup, and also track cohorts of users through their various stages of life on the app. I’ve already mentioned our BI database which works in tandem with a product we creatively call metrics engine to run daily digests. Metrics engine is a collective term for the up-front data digestion we do in our operations pipeline, report generation engine we have written in Ruby, and various map/reduce jobs that we run against our production and BI cluster on a nightly basis.
We run everything on AWS. Backstage API runs on 3 m3.medium’s, Shine API runs on 6 and Chrome API 2. Different flavours of Roadie process share 4 m3.medium machines. Vibrato runs on 1 m3.large machine. Our MongoDB cluster is our beefiest set-up, with a m3.2xlarge as primary on each of two shards, backed up by a redundant m3.3xlarge and three m3.small config servers that also act as arbiters. Our BI replica set is two m3.2xlarge instances. This runs the core of the Soundwave back-end application suite. Otherwise, we lean on other AWS products for search and DNS. PubNub is our realtime messaging platform of choice.
Each individual fleet and tier is horizontally and vertically scalable. We’ve had some success scaling vertically with high-memory instances on MongoDB, as well as horizontally by adding shards. Since our API and Roadie fleets are stateless, adding capacity here is also quite trivial. We can add extra, or beefier machines to our fleets in minutes using a set of provisioning and deployment scripts that interface with the AWS client CLI.
Our primary concerns with infrastructure are availability and responsiveness. From an availability perspective, we distribute all fleets evenly across three availability zones in each region. This way, an problematic zone results in impaired performance, rather than a loss of application function. Where possible, we also ensure our fleets have some built in redundancy. Our backstage fleet is not heavily loaded, however the minimum number of servers required to place a box in each availability zone is three, so we have that number.
The secondary MongoDB servers sit largely idle waiting to take-over should the primary go down. Each flavour of Roadie instance is configured to run on at least two of the Roadie servers. Overall, this costs significantly more in hardware and is definitely a more complex setup, but is a large part of the reason Soundwave ‘down-time’ over three years is measured in minutes. With regard to responsiveness, and the ability to handle load, configuring for redundancy and horizontal scalability also eases the operations burden, particularly in times of high load, or high signup rates.