Designing Pastebin

Let’s design a web service to store plain text and get a randomly generated URL to access it.

Crack FAANG

Published in

CodeX

10 min readJun 26, 2021

Similar Services: pasted.co, hastebin.com, chopapp.com

What is Pastebin?

Pastebin.com like services enable users to store plain text or images over the network (typically the Internet) and generate unique URLs to access the uploaded data. Such services are also used to share data over the web quickly, as users would need to pass the URL to let other users see it.

If you haven’t used pastebin.com before, please try creating a new ‘Paste’ there and spend some time going through different options their service offers. This will help you a lot in understanding this chapter better.

Requirements and Goals of the System

Our Pastebin service should meet the following requirements:

Functional Requirements

Users should upload or “paste” their data and get a unique URL to access it.
Users will only be able to upload text.
Data and links will automatically expire after a specific timespan; users should also specify expiration time.
Users should optionally be able to pick a custom alias for their paste.

Non-Functional Requirements

The system should be highly reliable, any data uploaded should not be lost.
The system should be highly available. This is required because if our service is down, users will not access their Pastes.
Users should be able to access their Pastes in real-time with minimum latency.
Paste links should not be guessable (not predictable).

Extended Requirements

Analytics, e.g., how many times a redirection happened?
Our service should also be accessible through REST APIs by other services.

Some Design Considerations

Pastebin shares some requirements with the URL Shortening service, but there are some additional design considerations we should keep in mind.

What should be the limit on the amount of text a user can paste at a time?

We can limit users not to have Pastes bigger than 10MB to stop the abuse of the service.

Should we impose size limits on custom URLs?

Since our service supports custom URLs, users can pick any URL they like, but providing a custom URL is not mandatory. However, it is reasonable (and often desirable) to impose a size limit on custom URLs to have a consistent URL database.

Capacity Estimation and Constraints

Our services would be read-heavy; there will be more read requests compared to new Pastes creation. Therefore, we can assume a 100:1 ratio between read and write.

Traffic estimates

Pastebin services are not expected to have traffic similar to Twitter or Facebook. Let’s assume that we get 1 million new pastes added to our system every day. Assuming a 100:1 read:write ratio, this leaves us with 100 million reads per day.

New Pastes per second: 1M / (24 hours * 3600 seconds) ~= 10 pastes/sec

Paste reads per second: 100M / (24 hours * 3600 seconds) ~= 1200 reads/sec

Storage estimates

Users can upload a maximum of 10MB of data; commonly, Pastebin like services are used to share source code, configs, or logs. Such texts are not huge, so let’s assume that each paste, on average, contains 100 KB.

At this rate, we will be storing 100 GB of data per day.

1M * 10KB = 100 GB/day

If we want to store this data for 5 years, we would need a total storage capacity of 180 TB.

100 GB * 365 days * 5 years ~= 180 TB

With 1M pastes every day, we will have approximately 2 billion Pastes in 5 years.

1M * 365 * 5 ~= 2 Billion

We need to generate and store keys to identify these pastes uniquely. If we use base64 encoding ([A-Z, a-z, 0–9, ., -]) we would need six letters strings:

64⁶ ~= 68.7 billion unique strings

If it takes one byte to store one character, the total size required to store 3.6B keys would be:

2 B * 6 = 12 GB

12 GB is negligible compared to 180 TB. To keep some margin, we will assume a 70% capacity model (meaning we don’t want to use more than 70% of our total storage capacity at any point), which raises our storage needs up to 260 TB.

Bandwidth estimates

We expect 10 new pastes per second for write requests per second, resulting in 1 MB of ingress per second.

10 * 100 KB = 1 MB/s

As for read requests, we expect 1000 requests per second. Therefore, the total data egress (sent to users) will be 100 MB/s.

1000 * 100 KB => 100 MB/s

Although total ingress and egress are not big, we should keep these numbers in mind while designing our service.

Memory estimates

We can cache some of the hot pastes that are frequently accessed. Following the 80–20 rule, meaning 20% of hot pastes generate 80% of traffic, we would like to cache these 20% pastes.

Since we have 100 M read requests per day, to cache 20% of these requests, we would need:

0.2 * 100 M * 100 KB ~= 2 TB

Feel free to play with this Excel file to get the estimates specific to your system.

Database Design

A few observations about the nature of the data we are going to store:

We need to store approximately 2 Billion records.

1 Million per day * 365 days * 5 years ~= 2 Billion

Each metadata object we are going to store would be small (less than 100 bytes).
Each paste object we store can be medium size (average 100KB, max 10 MB).
There are no relationships between records, except if we want to store which user created what Paste.
Our service is read-heavy. 100:1 read:write ratio.

Database Schema

We would need two tables, one for storing information about the Pastes and the other for users’ data.

High-Level Design

We need an application layer that will serve all the read and write requests at a high level.

The application layer will talk to a storage layer to store and retrieve data.

We can segregate our storage layer with one database storing metadata related to each paste, users, etc., while the other storing paste contents in some block storage or a database. This division of data will allow us to scale them individually.

Component Design

Application layer

Our application layer will process all incoming and outgoing requests. The application servers will be talking to the backend data store components to serve the requests.

How to handle a write request?

Upon receiving a write request, our application server will generate a six-letter random string, which would serve as the key of the paste (if the user has not provided a custom key). The application server will then store the contents of the paste and the generated key in the database. After the successful insertion, the server can return the key to the user.

One possible problem here could be that the insertion fails because of a duplicate key. Since we are generating a random key, there is a possibility that the newly generated key could match an existing one. In that case, we should regenerate a new key and try again. We should keep retrying until we don’t see a failure due to the duplicate key. We should return an error to the user if the custom key they have provided is already present in our database.

Another solution to the above problem could be to run a standalone Key Generation Service (KGS) that generates random six letters strings beforehand and stores them in a database (let’s call it key-DB).

KGS will make sure all the keys inserted in key-DB are unique. Whenever we want to store a new paste, we will take one of the already generated keys and use it. This approach will make things quite simple and fast since we will not be worrying about duplications or collisions.

KGS can use two tables to store keys, one for keys not used yet and one for all the used keys. As soon as KGS gives some keys to any application server, it can move these to the used keys table. KGS can always keep some keys in memory so that whenever a server needs them, it can quickly provide them. As soon as KGS loads some keys in memory, it can move them to the used keys table. This way, we can make sure each server gets unique keys. If KGS dies before using all the keys loaded in memory, we will be wasting those keys. We can ignore these keys, given the vast number of keys we have.

Isn’t KGS a single point of failure?

Yes, it is. To solve this, we can have a standby replica of KGS, and whenever the primary server dies, it can take over to generate and provide keys.

Can each app server cache some keys from key-DB?

Yes, this can surely speed things up. Although in this case, if the application server dies before consuming all the keys, we will end up losing those keys. This could be acceptable since we have 68B unique six letters keys, many more than we require. (We require only 2 Billion)

How to handle a paste read request?

Upon receiving a read request, the application service layer contacts the datastore. The datastore searches for the key, and if it is found, returns the paste’s contents. Otherwise, an error code is returned.

Datastore layer

We can divide our datastore layer into two:

Metadata database

We can use a relational database like MySQL or a Distributed Key-Value store like Redis or Memcached. The former would be preferable for an established firm and the latter for a fast-growing startup. Refer to this article for more details on choosing the type of database.

Block storage

We can store our contents in a distributed key-value block storage to enjoy benefits offered by NoSQL like HDFS or S3. Whenever we feel like hitting our full capacity on content storage, we can quickly increase it by adding more servers.

Cache

We can cache pastes that are frequently accessed. We can use some off-the-shelf solution like Memcached that can store full-text contents with their respective hashes. The application servers, before hitting backend storage, can quickly check if the cache has desired paste.

How much cache should we have?

We can start with 20% of daily traffic, and we can adjust how many cache servers we need based on clients’ usage patterns. As estimated above, we need 600GB of memory to cache 20% of the daily traffic. Since a modern-day server can have 256GB of memory, we can easily fit all the cache into three machines or use some smaller servers to store all these hot URLs.

Which cache eviction policy would best fit our needs?

When the cache is full, and we want to replace a link with a newer/hotter URL, how would we choose? The Least Recently Used (LRU) can be a reasonable policy for our system. Under this policy, we discard the least recently used URL first. Then, we can use a Linked Hash Map or a similar data structure to store our URLs and Hashes, keeping track of which URLs are accessed recently.

To further increase the efficiency, we can replicate our caching servers to distribute the load between them.

For further details, refer to this Caching article.

How can each cache replica be updated?

Whenever there is a cache miss, our servers would be hitting the backend database. Instead, we can update the cache and pass the new entry to all the cache replicas whenever this happens. This is also known as the Write-Thorugh policy. Refer to this Caching article for more details. Each replica can update its cache by adding a new entry. If a replica already has that entry, it can simply ignore it.

Load Balancer (LB)

We can add a Load balancing layer at three places in our system:

1. Between Clients and Application servers

2. Between Application Servers and Database servers

3. Between Application Servers and Cache servers

Initially, a simple Round Robin approach can be adopted; that distributes incoming requests equally among backend servers. This LB is simple to implement and does not introduce any overhead. Another benefit of this approach is that if a server is dead, LB will take it out of the rotation and stop sending any traffic. However, a problem with Round Robin LB is, it won’t consider server load. Therefore, if a server is overloaded or slow, the LB will not stop sending new requests to that server. To handle this, a more intelligent LB solution can be placed that periodically queries the backend server about its load and adjusts traffic based on that. Refer to this Load Balancer article for more details.

Purging or DB cleanup

Should entries stick around forever, or should they be purged? If a user-specified expiration time is reached, what should happen to the paste?

If we chose to search for expired pastes to remove them continuously, it would put a lot of pressure on our database.

Instead, we can slowly remove expired pastes and do a lazy cleanup. Our service will ensure that only expired pastes will be deleted, although some expired pastes can live longer but will never be returned to users.

Whenever a user tries to access an expired paste, we can delete the link and return an error to the user.
A separate Cleanup service can run periodically to remove expired pastes from our storage and cache. This service should be very lightweight and scheduled to run only when the user traffic is expected to be less.
We can have a default expiration time for each paste (e.g., two years).
After removing an expired paste, we can put the key back in the key-DB to be reused.

Security and Permissions

Can users create private pastes or allow a particular set of users to access a paste?

We can store permission levels (public/private) with each paste in the database.

We can also create a separate table to store UserIDs that have permission to see a specific paste. If we are storing our data in a NoSQL key-value or wide-column database, the key for the table storing permissions would be the ‘Hash’ (or the KGS generated ‘key’), and the columns will store the UserIDs of those users that have permissions to see the paste.

If a user does not have permission and tries to access a paste, we can send an error (HTTP 401) back.

Author: https://www.linkedin.com/in/shivam-ross/ | https://twitter.com/BeastofBayArea | https://www.instagram.com/sup.its.shiv/