Lesson Learned: Serving Code to Half-A-Million Users.
Once upon a time, my friend tweeted an idea about a quiz that could help our fellow Indonesian choose their presidential candidate using Buzzfeed-like quiz. You choose an answer to a problem and we show you which candidate that agree with you, simple.
Long story short, she can’t build the website by herself because she has to help her family with their political campaign things, and I got to join her quiz team. The team consists of the following great people: Asanilta Fahda, Audhina Nur Afifah, Catfish Catfight (Eduard Lazarus), and Fikri Khalqih.
We’ve discussed this for two weeks, but only been working hard on it around four days before the election date. Thankfully, we managed to release it to the public two days before the election, 15 April 2019.
Suddenly, we got viral
Within our first day of release, we got approx. 2450 unique users access our website in the first 2 hours. Going up to 13000 unique users after 5 hours within the release. And, boy, the next day I woke up at 9 AM and see the request on my App Engine dashboard jumped to 100 requests per second. I rushingly add Google Analytics to our website and the real-time active users counter going up like crazy.
I mean, look at that.
I was excited and worried.
I never thought it will reach that number at all. At this point, my ugly code and bad decisions start to build-up and multiply into a monster.
Brief Architecture of the Website
The back-end was created using Python. The main function of this script is just to validate the input, communicate with the database, and create the API output.
The whole system is running on Google Cloud Platform.
Why? Have you ever try AWS? Searching for something is painful. Excuse me, what the heck is EC2, S3, Redshift,…. The point is, my preference, for now, is solely based on how steep is the UI learning curve and their product naming. GCP has boring names: Compute Engine, Cloud Storage; but it’s easy to understand. Though, I knew several products on AWS is better, for example, the AWS’ FaaS: Lambda.
The front-end, running on AppEngine using F1 instances. The back-end, running on Cloud Functions. Lastly, the database running on Google Datastore. The main reason that I choose this stack is all of them is billable based on the usage. No one using it? $0 billed. Seems like a reasonable combination for me.
Simple, straightforward, and cheap.
The main advantage of this cloud setup is that I don’t have to think about the infrastructure at all. All I have to do is only to code. Thus, faster development and deployment.
In this write-up, I will try to share you what I was doing wrong when creating the website. Though it doesn’t break the website as a whole (e.g. not being accessible at all), it does cost me money for each mistake.
Lesson 1: Compression
The first that I noticed to be broken.
It still 6 hours after the release, so all other billing metrics are still well within free quota as expected, except one: Outgoing Bandwidth. The number looks surprising, but after I ran the math, it is reasonable.
So, I realized that the presidential candidate photos are still quite big (around 100KB each at the time). This image will be shown to the user every time they reach the quiz’s result page.
So, the next thing I did was to compress and resize the image. The compression resulted in images with a size of around 30KB. Three times less than the original.
What else could I optimize? I remember that HTTP request communication could be GZIP-ed. So, I found this GitHub wiki from the NextJS repository on how to enable the compression and I did implement that.
I can’t say that it helps a lot, but I saw a dip to the AppEngine outgoing bytes at the time around I implemented these.
Lesson learned: Though your website might seem small, a lot of incoming requests especially from new users (so, no cache available), could costs you a lot of bandwidth. Even 10 KB website could cost you 5GB when half-a-million users access it.
Lesson 2: Database
Learning database in university usually starts with the relational-type database. I learned how to normalized all the tables until 4NF or something. Those things were written hard in my brain and become a way of thinking when designing the database. This does not translate well to a key-value database such as Google Datastore. And I realized it too late.
So, I got a quiz database to build, how the relational diagram (kind of) for the database looks like?
In this design, there is an entity called Quiz that has many Question. Each Question has Answer from each User. So, accessing the data for a single user could cost me (Q + A) reads and writes. Reading Q entities of Question and A entities of Answer. Which in this case translates into 16 questions plus 16 answers, 32 data read/write per user at a minimum instead of just two reads/writes!
I realized this too late, the damage already done to the billing account. The better (though I don’t know whether it’s the best) design that I’ve come up for the database structure is as follows,
This way, the database read/write goes down 16 times because accessing the data would only really just accessing a key. To get questions, the back-end only has to retrieve 1 Quiz entity, then to get answers from a user, it only accesses 1 Answer entity (the key consist of Quiz key and User key, both already known thus O(1) retrieval).
You also might realize from the structure another stupid mistake. I forgot to put a timestamp. The timestamp is really important when you want to use and analyze the data you’ve got.
Lesson learned: Key-value and relational is a different database that should also be used differently. Whenever designing a key-value database, always strive for an O(1) read by key. With that in mind, the database design will be more efficient. Though, the design will be vary based on the application.
Lesson 3: Browser Compatibility
So, what makes the PilahPilihPilpres website prone to compatibility issue? The user fingerprinting, I suspect (I still can’t prove it, below on why).
I decided to include a way to keep track user without needing them to do login nor giving any personally identifiable data (such as name). This way, I could keep track of how many retries a user did. I am using fingerprintjs2 for generating the fingerprint.
Why I suspect fingerprinting causing the compatibility issue? There are two types of user error being reported continuously to us.
- So, when you access the PilahPilihPilpres homepage, you will be enforced to generate the fingerprint before you can access the quiz. For some users, this will completely make them stuck on the homepage. (Thanks, @dani_yp for reporting)
- The question page is not fully protected from the user without fingerprint, thus user without one can access it if they got the URL right. However, even though it already fixed, the user sometimes still could finish the quiz without any fingerprint detected on their cookies.
As you can see, both problems are related to fingerprint, thus my suspicion.
Why is it hard to prove it?
Almost 90% of our users are using mobile phones to access our website based on statistic given by Google Analytics.
When users are using desktop for accessing our website, debugging something wrong based on their OS and browser is easy using VM. The browser and OS on the desktop are also less diverse than in mobile phones.
I tried to install the same browser that my friend use that he said he got stuck on. It’s a browser from Xiaomi. My phone is also Xiaomi but on Android One so I didn’t get that installed by default. After I install the browser, I tried to access the website and all worked properly. I’m stuck.
— If any of you know on how to better debug this issue, please let me know. —
Lesson learned: To be honest there’s a lot to be fixed here. Here’s some.
- Mistake #1: Do not create a problem that doesn’t even exist. Starts with the need for fingerprinting, in this case, it could be fully replaced by cookie. At the time I decide to use fingerprinting, I just afraid that people will delete the cookie thus generating a new different identifier. Turns out this is just me hallucinating that the majority of people even know how to do this. By ignoring this problem, I could just generate an identifier using UUID, put it on the cookie or local storage, and call it a day.
- Mistake #2: Always have a fallback. Making user stuck on the home page is the worst thing could happen on a website. There should be a timeout in home page (if I decided to still use fingerprinting), after the timeout, the fallback identifier generation should kick-in (e.g. UUID or whatever the browser could do on client side). The backend also should have fallback whenever a request doesn’t contain any user identifier.
Lesson 4: Keeping Secret
I created the website code when I was on vacation to Bandung and a little bit when I did a business trip to Singapore. Although this can’t justify my bad code and bad design, this thing clearly affects those. This also further pressure me to finish the code as soon as possible, as fast as possible. Therefore, my lack of proper development setup as you can see on the git.
The main concern for this development setup is also the stupid one: no proper secret management.
I learned it the hard way: ALWAYS HAVE A PROPER SECRET MANAGEMENT, even you will never ever publish nor put your code on production.
One of the media that interviewed Asanilta Fahda wanted me to publish the code for checking the fairness of our system and to ensure there is no foul play behind the scene when calculating the result for the quiz. I quite panicked: reCaptcha secret, user token salt, and publicly accessible cloud function endpoint URL are hardcoded.
Just deleting those secret and put it on proper environment file will not delete the git history. At this time, I found out that someone already created a tool for fixing this stupid mistake. On GitHub help page, there are two alternatives that I could do: using git-filter-branch (that looks too complicated for me given the time) or using a tool called BFG.
The tools save me a lot of the time. Though on the git history you could still see my sins (the tool only replace the sensitive data with word ***REMOVED***). After applying this, I think: “OK, this is safe enough I’ll push it and make the repo public”.
Then, here comes my best security engineer friend: @visats. He found my hash salt. I forgot to put compiled python script (*.pyc) and pycache folder to the gitignore file, therefore, it is included on the git. He decompiled it into python script and the salt is shown in plain text. Thus, I have to delete the compiled file and the cache folder from the git history as well.
Lesson learned: Never hardcode the secret whatever the reason, no exception. If you did, delete it completely, just delete it on the last commit won’t be enough. Double check for compiled files that are included on git for the secret. Much better and safer is to reset or change all the secret after doing this kind of mistake.
Lesson 5: Website Serving
Though serving on AppEngine is far better than using traditional hosting or virtual machines in term of costs and scalability, it could be improved.
One thing that clearly could be improved in term of serving is to make the website front-end fully static so I could just put the code on some kind of storage / CDN. Thus, no hosting needed. I heard about this idea using AWS S3 from my colleague back then but haven’t really ever tried it out and afraid that the cost would be higher (seems unlikely though).
Thanks to Muhammad Mustadi, he told me via twitter about serving the static website on Netlify and Zeit which could make the process easier. He also taught me how to make the code a static one and what should I change on the code. Unfortunately, I didn’t have time, and doing so also didn’t decrease the cost significantly when the website is not as busy as the day before the election. So, for now, I didn’t implement it.
Lesson learned: Make the website static could reduce your cost, and no hosting needed. Though not implemented, this knowledge could be used for future projects. To make a static website (on NextJS), you have to make sure that
InitialProps does not access the context and all the cookies should be client-side or use the browser’s local storage (basically, there shouldn’t be any code that should run on server-side). For further information about this, NextJS already has a good tutorial.
Several more lesson…
Several things could be cached for better response time such as the question list. By caching the question list on some kind of memory database, it also reduce the need to read the Datastore every time.
Related to my problem of creating problem that doesn’t exist. Sometimes, you don’t even need a full fledged back-end. Several things could be easily handled in front-end such as result calculation. That way, the back-end only need to log the answer chosen not the result itself.
Cloud Function with Datastore
Both product are great with pay as use schema. But when combined, it doesn’t really that great. The Function even though is in the same network as the Datastore, cannot directly communicate with each other. So, to do that it has to go around to internet and calling the Datastore from there. Which causing high latency and high response time (Though usually only happen on cold start). One of the reported issue could be found here https://github.com/googleapis/nodejs-datastore/issues/9
I didn’t put any analytics tool when I first put the website online, which is a bad decision. Analytics tool (e.g. Google Analytics) could help us know our users better. Though scaling is not our problem (AppEngine automatically do it for us), knowing that we hit the intended user also important.
I am flattered for this one. Some people said the interface is good (though there is some concern about text style and color scheme). Turns out, my decision on this is not bad. People love simplicity. Clean interface and clear flow might be the main reason why people love it.
For the UI, I only use Bootstrap v4 with small modification here and there. Mobile first design is well paid off as almost 90% of our users use the mobile phone to access our website.
You know when created a good product…
Whenever I imagine a user, I imagine people who easily get angry with everything on a product and simply go away never came back after that.
But after I did this website, which has good content thanks to the awesome teammates who did the research, my perspective kind of changed.
When you create a good product, whatever error your user is facing, they simply will come back and try hard as they can to use the product.
And then, …
the story came to an end as the day of election passed by. The active user count goes down as expected with 1000 active user on the election day, 500 the next day, 300, then 20. We felt proud of what we’ve achieved.
I really would like to thank Indonesian for your support and enthusiasm for our work. To software engineers that starred the repository, thank you! I knew fully that it is a bad code, not something to be used for example, and your appreciation despite how bad it is, is really meaningful for me.
And most importantly, thank you Asanilta Fahda for giving me the opportunity to create something (again, I guess*) for Indonesia.
(*) Anyone can help me develop Hoax Analyzer? ;)
You might also found out another design mistakes, bug, stupidity from my code on the GitHub, I would really love to hear from you so I could learn more and get better. Please, message me on any social media or comment. Thanks! :)
Check out my new project at https://kutu.dev