Written by Valentin Agachi, Justin Vanderhooft, Teddy Martin
Here at Plex we’re passionate about all things media. In recent years we’ve built a variety of services for streaming content directly from our cloud services. Our most popular streaming feature is our Movies and TV Shows service that we launched in late 2019, but we’ve also built cloud services to allow you to watch free live TV, stream your favorite tunes from Tidal, and provide rich metadata across media types for our Plex Media Server users. More recently we’ve been developing watch.plex.tv as a one-stop place to find and explore everything you want to know about your favorite media.
As you might imagine there’s a lot of data to store behind the scenes to power these experiences. Our database of choice has been MongoDB, because of its performance and flexibility, which has allowed us to iterate quickly.
Our cloud services are currently handling around 500M requests per day, on a normal weekday, and even higher volumes on the weekend — due to the nature of our product and the fact that folks spend more time on entertainment on the weekend.
We’re currently using a replica set MongoDB cluster with data spread across 30 collections and totaling more than 1TB of data. Our largest collection stores over 300M documents. The cluster is handling about 30K reads per minute and shy of 3K writes per minute, during peak hours.
Pains with Mongoose
Mongoose is by far the most popular and common way to interact with MongoDB in Node.js. It is certainly the best starter library for new MongoDB users.
Naturally, we picked this library when we started developing our cloud services at Plex. It has served us well for a while. However, over time, we’ve run into issues with it.
Because of the scale at which we are operating our services, and because we try to be mindful to not waste money on compute nodes, we started, very early on, using only the lean objects from the Mongoose models operations. We never used the full Mongoose model instances due to memory and performance concerns.
Furthermore, the validators defined on a Mongoose schema are only applied on update operations (
updateMany) if they are enabled by an option sent to these methods, which defaults conveniently to false.
While this annoying artifact can be worked around, another case of data manipulation can not be worked around: bulk writes. In our services we often need to perform several operations at once, and we’ve found that batching these operations through
bulkWrite is much more efficient than individual operations. However, Mongoose does not support running the schema validators in bulk write operations.
Due to all these issues, a high traffic service in a fast moving company/startup will inevitably end up with inconsistent data in the MongoDB collections. This is where we found ourselves in a short period of time.
TypeScript was promising because it’s a perfect tool to help track down a lot of the bugs that were causing the inconsistent data issues mentioned earlier. However it can only do so if you have the correct TypeScript interface definition for a model.
We looked into adopting other npm packages that augment Mongoose and make it more TS-friendly, but we didn’t like the options and tradeoffs involved with those solutions.
We already had the schema definitions written for Mongoose. We wrote a conversion script which generated a TypeScript interface from a Mongoose schema, and this worked out fine, for a while. However, when we generated these TypeScript interfaces for the models we had to take into account the fields that could be populated with data from other collections (`ref` in Mongoose). This introduced a lot of conditional code in our code base due to types like
item: ObjectId | ItemDocument
At some point last year, we took a step back and looked at how we’re accessing our MongoDB data. After all the issues presented above, we realized that we’re not really using any feature from Mongoose any more, and use it as middleware between our app and the database.
This led us to building Papr with only the features we were using in our services.
Papr is a lightweight library built around the MongoDB Node.js driver, written in TypeScript. It supercharges your application’s relationship to the MongoDB driver, providing strong validation of your data via JSON schema validation and type safety with built-in TypeScript types.
We wanted to have an easy migration path for our 300K LOC repository, so we kept the public API very close to Mongoose’s API — a Papr model will have most of the public methods available:
All these methods are very thin wrappers around the native MongoDB driver methods. We only added some additional code to support default values for attributes and timestamps attributes.
The schema definition for a Papr model is different from Mongoose’s though. This is because we try to kill two birds with one stone: we define a TypeScript interface for static checking from it, and we also generate a JSON schema validator for MongoDB server at runtime. The inspiration for this schema trick came from the ts-mongoose package.
Let’s take a look at how a Papr model is defined:
So the following code will result in these TypeScript types:
We also enhanced the TypeScript types around projections in queries, so you get static safety when you make queries with reduced attribute projections:
Validation with MongoDB JSON Schemas
Having your code statically checked and verified for type errors is great. However, if your TS coverage is not 100%, you still want to have some safety net and make sure that the data stored in the database is accurate.
Papr uses MongoDB’s JSON schema validation feature to enable validation of document writes at runtime.
From the schema definition, at runtime, we generate a JSON schema specification that we can apply as a validator in the MongoDB collection. The validation of the data will be performed by the MongoDB server on inserts, updates and even bulk writes (remember how Mongoose doesn’t do this for you?).
An even better side-effect of this feature is that the validation is now performed inside the MongoDB server and not your application process, freeing it up to do other concurrent tasks (responding to requests, etc.). When we migrated all our writes to use this type of schema validation, we did not notice any degradation in the MongoDB operation times.
Given the schema we defined in the earlier example for
UserModel, the following JSON schema validator will be generated and applied to the collection:
For more information about the JSON schema validation feature in MongoDB, please read the following:
Because we’re doing very few operations on top of the MongoDB driver, the performance of Papr is very close to the driver’s performance. Especially, if you consider the fact that MongoDB server is now doing these validations in all the write queries.
View the benchmark results in our documentation.
We’re excited to introduce the world to Papr. We think it’s a great solution for confidently communicating with MongoDB, in Typescript, at scale. Feel free to subscribe to our github repository and contribute to the code.