How we moved a half million user site from the Stone Age to the present with a rebuild
Over the past half year, I’ve been working with Kevin Kimball on rebuilding Bubblews.com. In this post, I wanted to share more about what we did and why it was necessary.
Bubblews is a social blogging platform that shares advertisement revenue with users. The site has right around half a million user accounts and 10 million posts of 400 characters or greater. It’s also very social — there are around 250 million comments on the site and 100 million connections between users (a user following another user). Advertising revenue is shared based on actions that occur on the site: liking, commenting, and viewing a post.
I joined the company in April of 2014 as the second engineer. The site was in the stages of getting a facelift and other changes with a big relaunch that July. While my initial focus was supposed to be front-end, I became fullstack (which was my background) soon after to able to help us hit the deadline. After a lot of sleepless nights and cursing at the nasty PHP code the previous developer had written, we finally launched the new version of the site. I also became CTO.
After the relaunch, the site was getting a sizable amount of traffic and attention:
- Top 1,500 site worldwide according to Alexa rankings. Top 100 site in countries like India and Philippines.
- News coverage in Inc., Fast Company, New York Times, Fox News, Associated Press, USA Today, and numerous other news organizations
- Listed as one of the Top 10 Social Networks on the Rise by Mashable, alongside Snapchat, Tinder, Medium, and Vine.
As the site grew and grew, we realized a few months after launch that we desperately needed to rebuild the site to be able to grow, so we started on a rebuild at the end of last year.
Now let’s deep dive into the different parts of the site.
To put it simply, the code on the old site (the one that launched last summer) that we inherited was disgusting. The old site was written in PHP and used an outdated version of the framework Kohana. Any sense of using MVC had been thrown out the window and trying to find where something occurred often had to try to be found through recursive greping.
There were bugs throughout. Best practices were ignored. Even worse, there were gaping security holes like XSS vulnerabilities and SQL injection opportunities. At one point, we noticed that there was no permission check on deleting comments, so someone could have deleted every comment on the site if they wanted to. Flagging of comments and posts were done through GET requests, so search engines crawling the site would consistently flag posts unintentionally and even index the pages in search results.
I could go on and on with the reasons it was awful, but I’ll resist. The biggest issue that we ended up having with it was that it was incredibly slow and painful to build on. Adding a new feature or making a change was dreadful.
This became a huge issue with the large growth that occurred last summer because we couldn’t move fast enough to do the changes we needed to do. As a site that shares ad revenue, we inevitably attracted users that came with bad intentions and aimed to make money through any means necessary, which meant bot scripts, rampant plagiarism, “like” circles, DDOS-type spikes where people would buy traffic to boost their scores, and anything else people could try to cheat the system. We were always trying to play catchup instead of being able to preemptively build in protections.
In the fall of last year, we realized that there was no way that the site could grow in the future without investing time and resources into rebuilding the site with a decent code-base.
After briefly considering sticking with PHP and using Laravel, we made the decision to use Ruby and Ruby on Rails for a lot of the same reasons every else does: it’s fun to use, both of us already were familiar with it, there are a ton of existing libraries that would be useful for implementing, and we could built out a new site quickly.
So off we went with building a new site with the first commit in November of last year.
One of the goals with the rebuild was streamlining the site so that developers could work with any part of the site. When I came on board, there were two other technical members on the team — one was more of a devops type guy that was contracted to handle the servers and the other was originally hired to handle the back-end, but only handled building and running the web services for notifications and search.
Kevin Kimball joined the team a few months after the launch last year as a fullstack developer and has done a tremendous job (I knew Kevin from North Carolina). Kevin and I were the only developers that worked on creating the new version of site.
Because the team and site was no one near large enough where is made sense to have team members that solely focused on one area, a goal with the code and infrastructure on the rebuild was to set it up in a way that general fullstack developers could handle and work on any part of it.
We were also locked into a position previously where we were using sub-optimal systems that we wanted to get rid of but couldn’t. The external web services for notifications and search was a mess, consistently had errors, was expensive to run, and didn’t work well, but it was so intertwined with the site that we couldn’t move off it. For the servers, a contract was signed before Kevin or I joined that locked us into a a setup that were both expensive and forced us to always go through the contractor to make any changes to server config, which was frustrating, to say the least.
Starting from the beginning, we knew we wanted to go PaaS and use Heroku to host the website. I’m fine with provisioning boxes and other server setup, but I don’t particularly enjoy it and I’d much rather spend my time on writing code. Heroku has been painless and pleasure to use. We haven’t had any downtime yet due to Heroku and the only issue we’ve had so far was git deploys being turned off temporarily.
Previously the site was on hosted on six beefy dedicated servers located in Michigan. We were locked into an expensive contract, so we couldn’t adjust with traffic fluxes. We kept one of the servers out of rotation as a psuedo-staging server. For costs, we’re currently spending on Heroku less than half of what we were spending on the dedicated servers.
The contractor we were also forced to work with was not someone we wanted to have to continue to deal with. It would take forever to get responses and he would not us give full access to our servers, so it made it incredibly difficult to do the things we needed to do. We were delayed by over a month at one point waiting for his response to something, which is an eternity in the startup world.
The previous site also did not have any system for background jobs setup. We now have multiple queues and use delayed_job to push image processing, feed generation, sending of notifications, and other slower tasks into them.
Images are stored on S3. We have about 2TB worth of images stored on it currently, yet it costs less than $100 a month. Go AWS.
With the switch to Heroku, we also made the choice to move from MySQL to PostgreSQL to take full advantage of everything Heroku has built in. There are also the other nice features of Postgres that were appealing: concurrent indexing, ability to add/remove columns without table locking, and additional data types. For converting from MySQL to Postgres we found that py-mysql2pgsql worked the best.
After the conversion, we had to run a ton of migrations to get the database ready for codebase and new database type. There were actually 90 separate migration files that needed to be run before switching from the old site to the new site.
My favorite migration was a remove_column that cleared out 40GB, a quarter of the size of the database, due to us changing how the HTML in the post view was rendered.
With the switch to Heroku, we also started running a follower database for realtime replication so we’d have a hot database on standby if it is ever needed. We also use it for some read queries — for example sitemap generation and analytical queries. Previously we’d have to do some longer queries on the prod database that would have a negative effect on users’ performance. I remember at one point having Terminal open on one screen and New Relic open on the other watching how the queries where doing on the database as we were pulling stats for a potential investor right before a meeting.
Oh yeah, in addition to the follower database, we also now have regular backups. Previously replication was never setup on the database by the contractor and backups required taking down the site for around 3 hours, so they were infrequent. It was super risky and we are so glad that we no longer have to worry about it anymore. We sleep a lot better now.
The old site had essentially no caching. There were only two spots that I can remember where Redis was used for caching — follower counts on user profiles and caching content on post pages. With so little caching, the site was hitting the database non-stop with a large amount of queries when it wasn’t needed.
With the rebuild, essentially everything that can be cached now is. There is fragment caching throughout and queries that were still slower than we would have liked after optimization were kept in Redis.
Search on the old site used Elasticsearch hosted on AWS EC2. To put it bluntly, the search sucked. It had issues with uptime, it wasn’t particularly quick, the results were often poor, and it was fairly expensive to run. With the rebuild, we knew we wanted to either ditch the old system and put together something decent or look to using a 3rd party service.
We’d heard good things about Algolia and had good luck with it during development and with our other company Sweeble, so we ultimately ended up switching over search to it. Algolia’s Ruby gem makes it so that it can be added to a Rails project in a few minutes. Indexing 5 million records took around half a day with multiple worker dynos running.
There are a lot things we like about Algolia: high uptime, super quick search results, easy customization, a powerful analytics dashboard, and great support. We have around 5 million records indexed in it right now and the average response time was under 60ms for 90% of the records during the last 24 hours and index build time under 60s. Our users are also super pleased that it works much better than the previous search.
The old site used an external webservice that was built by another developer that used Elasticsearch and Express/NodeJS to build notification feeds for users. Initially this was super buggy and provided very limited info, but eventually got the point it was stable enough to use that it wouldn’t take down the site.
In our quest to simplify things, find a better setup, and eliminate the costs associated with maintaining the other system, we looked to using a 3rd party service and that would allow us to devote development time elsewhere.
We send a few million notifications through it each month and it’s worked great so far.
Every user has their own personalized feed and it’s common for users to follow thousands of other users. To make things as speedy as possible, we switched over to a fan out on write feed strategy, store the feed in Redis, and query that when the user visits their feed.
For our users, view count tracking is important for them to know how many people are reading their posts. For us, it’s important to know so we can accurately share ad revenue.
Previously, view tracking was very easily manipulated which meant the company was paying out money to people who were falsely boosting their view counts with traffic exchange sites. Views were also tracked by inserting a record with the post id and IP address, which meant that before every view was tracked, there was a search of a table with hundreds of millions of records to check for uniqueness.
This time around, we switched view tracking to Redis for speed and accuracy. When a view occurs, we use Redis’s SADD to add in an a MD5 hash of the browser fingerprint to track uniqueness. Every 10 minutes, a Rake task loops through the posts in Redis and updates the column on the post record. Every 24 hours, we loop through the posts and clears the array for each post to keep the amount of memory used down.
As I mentioned above, a major issue for us with the old site was there was very little in place to prevent people abusing the site trying to make money. This time around, we built in additional protections to both catch and prevent spam and abuse.
Plagiarism had been rampant on the site. While some of it was people copying from Wikipedia or other external sources and claiming it as their own, a lot of it was people copying posts that already existed on the site.
Previously we stored a MD5 hash of a post’s content to check it for uniqueness against the other posts on the site. This worked for finding exact matches, but a user could change one word and we wouldn’t be able to catch it. With the rebuild, we wanted to detect posts that had very similar content too, so we switched to using Simhash, an algorithm developed at Google that has been used in their search engine to find websites that share content. Stripping stop words and using Simhash has been very effective at finding duplicate content on the site.
The next issue we needed to tackle was spam. Whenever a post happened, we wanted to know right away the likelihood of it being spam, hide it from being displayed, and flag it for review our moderators. On the old site, there was no way to do this and we were dependent on our users flagging posts to find them.
We ended up deciding to use Bayes’ theorem with machine learning to teach what was spam and what wasn’t. It has worked very well within a month of teaching it and only a few percent of what it flags are false positives, which moderators can then remove after review.
One of the most requested features on the site has been private messaging, so we finally decided to build it. The chat is built using ReactJS with websockets through Pusher.io. It was incredibly fun to build and we had a working prototype up for testing on the first day.
There are a few other the parts we’ve moved to using with ReactJS, which we started to fall in love with after using it with some of the web views on Sweeble. We plan on moving a lot more of the front-end to it in the future.
In addition to the ones already mentioned, here are some other 3rd party services we use with the site.
The old site didn’t use any error tracking, which made it an absolute pain to see what was going on in the site and would require digging through logs. It wasn’t uncommon for us to hear from our users about errors before we realized it was going on. Far from an ideal situation. We’re now using Raygun.io so we can instantly see when an error occurs and fix any issues. We get emails and notifications in Slack about any errors.
Slack webhooks are used extensively for keeping tracking of what’s going on with the site. We have rake tasks scheduled with cron jobs that give us info on the number of background jobs in queues, popular posts on the site, payment info, signup stats, moderator information, and a ton more.
Like virtually everyone, we use New Relic to monitor our setup. It works well and integrates nicely with Heroku.
Logs are dumped into Papertrail to make it easier to search later.
To keep costs as low as possible and take full advantage of Heroku’s ability to easily spin up (and down) servers, we use Adeptscale to adjust the number of dynos when needed.
Pingdom checks that the site is up. If the site goes down, it blows us up with text messages. It’s simple and has worked great.
Emails go out through Mandrill. Previously we’d used Amazon’s SES, but switched to Mandrill so we’d have better access to email analytics and have Mandrill automatically handle unsubscribes for us.
Our DNS is through Cloudflare. We use a lot of the other features of Cloudflare: CDN, caching, IP blocking, DDOS protection, and GeoIP so we can see where users are located.
In June we made the announcement to our users that a new version of the site would be happening and scheduled some downtime to make the switch. We took the old site down, converted the database, loaded it into Heroku, and we cranked through the migration scripts as quick as we could. Previously we’d done a few dry runs which helped with making sure things went smoothly.
Our users were extremely excited and pleased about the new site. We’d mentioned in the months before that a big change would be coming and even those that were very skeptical that the site would be an improvement were pleased.
Also, at the same time as we were working on the rebuild , we also worked on and launched a new company, Sweeble.
For more info, here are two of my posts introducing the new site:
And here are just a few of the posts from our users:
In addition to simply having a better site, we also ended up saving a substantial amount of money. By changing servers, utilizing 3rd party services, and spending time optimizing, the cost of the running the site is now around a third of what it was previously. Through the streamlining, we also removed the costs (salary, benefits, etc.) associated with having the third engineer.