How to sort posts like a pro-Developing a post-comment platform
One of the most common application patterns you’ll see out in the world of consumer software is the post-comment platform. From social networks and forums to news sites and blogs — almost everyone uses it. And as a developer it is crucial to understand how it works and how to make an implementation.
What is a post comment platform?
The core idea is simple: Users can make posts that pile up in some sort of public registry that other users can then check. Originally inspired by good ol’ paper-based bulletin boards, this design paradigm usually takes things a step further by also allowing users to comment publicly on posts. Thus, the name “post-comment” platform.Creating such a platform to be successful and serve tonnes of users has many challenges, as I discovered when I recently developed on my own public diary-keeping app in the same style. One of the most challenging battles I had was with a feature that seems completely arbitrary — ordering of posts. In this article I’ll discuss the problems I faced and how I overcame them. I hope my solutions will help materialize the dreams of startup devs that plan on making the next big thing in consumer communication.
Pagination
For my platform, I had a very simple client-server setup, where my Android app would make a request to the server for a list of posts, and the server would respond appropriately. The simplest way to go about hosting posts is just to send all the stored posts as one big response when a request hits the server. The obvious problem is that after a while, the database is going to contain thousands (or if all goes well, millions) of posts and sending them all will take an immensely long time. It will use a lot of CPU and network time and most clients have a connection read timeout of only a handful of seconds (requests have a maximum amount of time they can take before the client (and the user) gives up). So, this isn’t going to fly for multiple reasons.
To solve this problem, the concept of pagination was invented. When the client requests posts, the server sends down a capped amount, say 25. Then if the client wants more, it can request page 2, and so on. This is a great solution to the problem of displaying large amounts of remote data, and a lot of frameworks have very nice ways of implementing it: For instance, MySQL supports it natively in queries by making use of the LIMIT clause. LIMIT 50, 25 will let MySQL retrieve 25 posts, starting at post #51 (i.e. page 3). Android has a class called RecyclerView that lets the user scroll through a pseudo-infinite list. When the user comes near the bottom, the next page of items is pulled from the server and added to the bottom of the list.
Living Databases
Alright, but now you must keep in mind that when the server sends down the posts, it’s reading from a live database. This means by the time the second page is requested, there might have been a new post created. This would cause duplicates to appear in the list, like this: Imagine the client requests page 1 of the posts. The posts are all ordered chronologically and the first 5 are sent:
While the user is reading the first 5, a new post is created and put into the database. Chronologically, all posts move up one space.
Now, when the client requests page 2, the posts are ordered again, and the second 5 are sent through (i.e. post 6–10). However, post #6 is post #5 from before (because of the new post shifting things up), so that specific post is sent through twice, resulting in a duplicate.
A workaround for this would be to have the client look for duplicates and filter them out before displaying to the user. This is fine if you have one or two duplicates per response, but if the database gets busy, the “second page” could be identical to the first page, causing a wasted request. Things could get out of hand very quickly.
The Easy Solution
The simplest way out of this mess is to create an ID of all your posts. In the case of my database, all records have ID’s represented as adjacent integers. That means for any post, its ID+1 would be the next post, and its ID-1 would be the previous post. This is super convenient. So now instead of asking the client to send a page number, it simply sends the last post ID it has, the server finds that post, and sends the 5 posts that come right after it.
Custom Sorting
This is in fact exactly how I first implemented pagination while I ran my app’s beta. It was all fine and dandy, but then I got a little ambitious. I didn’t like that posts were being ordered chronologically. It meant that high quality posts could go down the list quickly, being pushed down by potentially uninteresting, low-effort posts, simply because they were more recent. For a good user experience, you would want hot posts (Facebook calls them “Most relevant comments ) to stick around the top, and then as time goes on, have them gradually move down the list. This strikes the same contrast of sorting by new vs sorting by “hot” in reddit. Even Facebook and Instagram place more popular posts higher, so users see them quicker. Hence this calls for A CUSTOM SORTING SOLUTION.
Implementing custom sorting has a few challenges. First off, how do you determine the quality of a post? This might as well be a philosophical question, but the metric I ended up using was the number of unique views. I realized that this will benefit click-baity posts, but for a startup platform, this was a risk I was willing to take — at least at the beginning stages. A few endpoints later and I was collecting data on user clicks.
Now, for the actual sorting I devised a little algorithm to calculate a popularity rating for every post and then have my SQL query sort by that value. For any given post, to calculate its popularity rating the expression is as follows:
popularity = views — ((now — creation) / scale) * live
views is the amount of times users have clicked on the post. (now — creation) is the difference in seconds between now and post creation — i.e. its age. live is what I like to call the liveliness constant. It is a number that represents how busy the platform is expected to be (see below). Finally, scale is a constant that represents the number of seconds in the timescale we’re working with. In my case, it’s hourly, so scale will be 60*60.
What I really want to hammer down with live is how many clicks in the space of the timescale (an hour) is necessary to constitute a hot post (one you want near the top when posts are shown to the user). This should ideally be directly proportionate to the size of the user-base. This is for two reasons: The frequency of new posts, and the frequency of clicks on posts. The former determines how long you want to keep a high-quality post at the top, and the latter determines how hot the post is. When live is set high, it means many clicks are necessary for a post to remain at the top of the rankings, which would indicate that you have many users to keep those clicks coming, and many new posts to compete with it. When live is set low, it means very few clicks are necessary to keep the post up top, which would imply you have few users to make clicks, and few new posts to compete with it.
For my app, I chose an arbitrary value of 50 (i.e. 50 views per hour is a hot post). Let’s dig into the expression. On the right side, the number of seconds since the post was created (now — creation) is divided by the scale, to give the number of hours since the post was made. This is then multiplied by my liveliness constant, 50, to give the number of views necessary for this post to constitute being hot. If the post is 2 hours old, this will be 100 views; if half an hour, 25; and so on. Finally, this is subtracted from the number of views the post has in actuality received. What’s left is a number that should rather accurately determine how hot the post is. If this number is positive, it means the post has surpassed the requirement for being hot and if it’s negative it hasn’t. The popularity rating is effectively the number of views a post is under or over its “hot” requirement. My SQL database can now easily sort by this generated field.
I just want to mention that live is a constant and must be determined before go-live. This is because live is used very frequently for sorting posts, and if it were dynamic, the sort order would be different each time posts are requested. If you want to swap out live, do it at 4 am.
Pagination Issues, Again
My custom sorting method that I thought was so creative, actually introduces a new kind of issue for pagination. Because I’m sorting with a live metric (clicks per post), the order of the posts is potentially different each time they are retrieved. E.g. if a particular post gets a lot of views in between retrievals, its position will change, and the second request will contain duplicates. Here is an illustration of the problem:
There is no consensus among computer scientists about how to solve this problem. There isn’t even consensus among users about what they want the behaviour to be in these scenarios. Should popular posts move to the front and skip over the user’s scrolling, or should it mix in with old posts so that the user sees them? Every social network has its own way of handling this issue — Facebook for instance finds a sort of balance between popular and unpopular posts, as you scroll through, at the detriment of chronological ordering.
My way of tackling this problem was to enforce that pagination is consistent and posts don’t jump around as the user scrolls through them. One way of going about this is making a dump of 1000 posts (in the proper order) as soon as the first request comes in. Then, when the second one happens, the server reads from the dump instead of from the live database. The advantages of this is that the order is preserved throughout requests and that subsequent calls will be much quicker, since no calculations need to be done. The disadvantages are that the first call takes longer, since all the posts have to be duplicated somewhere else, and that it obviously takes a lot of storage space. You can imagine, if 100 clients request posts in the space of 1 minute, the server’s disk is going to fill up very quickly with duplicate posts. Also, and this is a more minor issue, but if the client ever requests posts beyond the 1000th, the server won’t have anything to send.
So, making a snapshot of the database at the point of request is not an option. What is however an option is regenerating the snapshot when a request comes in. That is, when the client requests page 10, query the database such that it displays and orders content as the state was when page 1 was requested. All you do is make sure the relevant data is timestamped (i.e. posts and their views) and when calculating popularity, limit your data by the first request’s timestamp. Here is the full process of my final solution:
- The client makes a request to the server to retrieve posts.
- The server generates a requestID and stores it together with the current timestamp. This is the first request’s timestamp.
- The server gets all posts, orders them and sends the first page as well as their generated requestID to the client.
- The client shows the posts to the user.
- The client requests the second page from the server, and sends the requestID with this request.
- The server gets all posts before the first request’s timestamp (gotten using the sent requestID). It then orders these posts using the popularity rating from above, but only uses clicks from before the first request’s timestamp. This effectively gets the exact same list as when the first request was made.
- From this list the server gets the second page and sends it to the client.
- The client shows the second page of posts to the user.
- Ad infinitum…
Isn’t that cool? So now the client can request pages forever and there will never be a duplicate; even if the second page is requested a year after the first one.
Some considerations
This is all great, but at some point, you want the client to have new posts, too. Some services, like reddit, use funky algorithms to add new posts later on in the list, for instance, on page 10. This allows the user to just keep on scrolling forever and new content will pop up as she goes. Really cool, however a bit outside the scope of my app. For me it is fine to simply dispatch a new requestID whenever the first page is requested. This essentially forces a refresh every time the user logs in and requests page 1.
Another thing you might want to think about is how to handle the liveliness constant. I’d really like it to scale together with my userbase, but as I previously discussed, it can’t be dynamic. You could make it semi-dynamic by having a script recalculate it weekly and update it when the database is not busy. Alternatively, since each client will get a different list of posts anyway, you could calculate the constant when the requestID is dispatched and store it together with the timestamp. This will allow the sorting query to work with the liveliness at the point of first request, instead of the current liveliness. Taking this one step further you could even calculate x on the fly as you’re ordering the posts. The issue here is that it takes up a lot of unnecessary CPU to execute that math for each and every post. Also, if a user registers in the middle of the sorting process (this could happen if you’re dealing with millions of posts) the order will be incorrect, because the liveliness would have changed midway through sorting.
But these are all problems for another day. The custom sorting, pagination, and liveliness database issues I have described here are, in my view, sufficiently well handled for any post-comment platform startup
Authored by J. George Rautenbach. Computer Science & Philosophy student at UCT.
George Linkedin profile: https://www.linkedin.com/in/jans-rautenbach/
George Github profile: https://github.com/koellewe
Contact George at : george@rauten.co.za
Edited and reviewed by Willie Macharia, UCT Developers Society Chairperson,ngangawillie84@gmail.com
Email us at uctdevsoc@myuct.ac.za