Lesson Learned: Serving Code to Half-A-Million Users.

Once upon a time, my friend tweeted an idea about a quiz that could help our fellow Indonesian choose their presidential candidate using Buzzfeed-like quiz. You choose an answer to a problem and we show you which candidate that agree with you, simple.

Long story short, she can’t build the website by herself because she has to help her family with their political campaign things, and I got to join her quiz team. The team consists of the following great people: Asanilta Fahda, Audhina Nur Afifah, Catfish Catfight (Eduard Lazarus), and Fikri Khalqih.

We’ve discussed this for two weeks, but only been working hard on it around four days before the election date. Thankfully, we managed to release it to the public two days before the election, 15 April 2019.

The website is called PilahPilihPilpres. You can access the code on GitHub.

Suddenly, we got viral

I mean, look at that.

I was excited and worried.

I never thought it will reach that number at all. At this point, my ugly code and bad decisions start to build-up and multiply into a monster.

Lesson learned.

Brief Architecture of the Website

The back-end was created using Python. The main function of this script is just to validate the input, communicate with the database, and create the API output.

The whole system is running on Google Cloud Platform.

Why? Have you ever try AWS? Searching for something is painful. Excuse me, what the heck is EC2, S3, Redshift,…. The point is, my preference, for now, is solely based on how steep is the UI learning curve and their product naming. GCP has boring names: Compute Engine, Cloud Storage; but it’s easy to understand. Though, I knew several products on AWS is better, for example, the AWS’ FaaS: Lambda.

I digress.

The front-end, running on AppEngine using F1 instances. The back-end, running on Cloud Functions. Lastly, the database running on Google Datastore. The main reason that I choose this stack is all of them is billable based on the usage. No one using it? $0 billed. Seems like a reasonable combination for me.

Simple, straightforward, and cheap.

The main advantage of this cloud setup is that I don’t have to think about the infrastructure at all. All I have to do is only to code. Thus, faster development and deployment.

Some of you might think, why do I even need a database? It just a simple one-time quiz program without any dynamic on the questions and result. Yes, you’re right if you only see the website like that. But I saw it as an opportunity to mine data. The data generated until this article was written, hasn’t been even touched yet. But I believe there are insights could be extracted from those data. This is the reason why we put the Privacy Policy below the ‘Start’ button on our website’s homepage.

In this write-up, I will try to share you what I was doing wrong when creating the website. Though it doesn’t break the website as a whole (e.g. not being accessible at all), it does cost me money for each mistake.

Lesson 1: Compression

This is an illustration as I can’t retrieve the previous billing status and it is exaggerated.

It still 6 hours after the release, so all other billing metrics are still well within free quota as expected, except one: Outgoing Bandwidth. The number looks surprising, but after I ran the math, it is reasonable.

So, I realized that the presidential candidate photos are still quite big (around 100KB each at the time). This image will be shown to the user every time they reach the quiz’s result page.

No, this is not an endorsement. This is just a sample image of a presidential candidate.

So, the next thing I did was to compress and resize the image. The compression resulted in images with a size of around 30KB. Three times less than the original.

What else could I optimize? I remember that HTTP request communication could be GZIP-ed. So, I found this GitHub wiki from the NextJS repository on how to enable the compression and I did implement that.

I can’t say that it helps a lot, but I saw a dip to the AppEngine outgoing bytes at the time around I implemented these.

Lesson learned: Though your website might seem small, a lot of incoming requests especially from new users (so, no cache available), could costs you a lot of bandwidth. Even 10 KB website could cost you 5GB when half-a-million users access it.

Lesson 2: Database

So, I got a quiz database to build, how the relational diagram (kind of) for the database looks like?

In this design, there is an entity called Quiz that has many Question. Each Question has Answer from each User. So, accessing the data for a single user could cost me (Q + A) reads and writes. Reading Q entities of Question and A entities of Answer. Which in this case translates into 16 questions plus 16 answers, 32 data read/write per user at a minimum instead of just two reads/writes!

I realized this too late, the damage already done to the billing account. The better (though I don’t know whether it’s the best) design that I’ve come up for the database structure is as follows,

This way, the database read/write goes down 16 times because accessing the data would only really just accessing a key. To get questions, the back-end only has to retrieve 1 Quiz entity, then to get answers from a user, it only accesses 1 Answer entity (the key consist of Quiz key and User key, both already known thus O(1) retrieval).

You also might realize from the structure another stupid mistake. I forgot to put a timestamp. The timestamp is really important when you want to use and analyze the data you’ve got.

Lesson learned: Key-value and relational is a different database that should also be used differently. Whenever designing a key-value database, always strive for an O(1) read by key. With that in mind, the database design will be more efficient. Though, the design will be vary based on the application.

Lesson 3: Browser Compatibility

I decided to include a way to keep track user without needing them to do login nor giving any personally identifiable data (such as name). This way, I could keep track of how many retries a user did. I am using fingerprintjs2 for generating the fingerprint.

Why I suspect fingerprinting causing the compatibility issue? There are two types of user error being reported continuously to us.

  1. So, when you access the PilahPilihPilpres homepage, you will be enforced to generate the fingerprint before you can access the quiz. For some users, this will completely make them stuck on the homepage. (Thanks, @dani_yp for reporting)
  2. The question page is not fully protected from the user without fingerprint, thus user without one can access it if they got the URL right. However, even though it already fixed, the user sometimes still could finish the quiz without any fingerprint detected on their cookies.

As you can see, both problems are related to fingerprint, thus my suspicion.

Why is it hard to prove it?

Almost 90% of our users are using mobile phones to access our website based on statistic given by Google Analytics.

Statistic from Google Analytics about the operating system used by our user.

When users are using desktop for accessing our website, debugging something wrong based on their OS and browser is easy using VM. The browser and OS on the desktop are also less diverse than in mobile phones.

I tried to install the same browser that my friend use that he said he got stuck on. It’s a browser from Xiaomi. My phone is also Xiaomi but on Android One so I didn’t get that installed by default. After I install the browser, I tried to access the website and all worked properly. I’m stuck.

If any of you know on how to better debug this issue, please let me know. —

Lesson learned: To be honest there’s a lot to be fixed here. Here’s some.

  • Mistake #1: Do not create a problem that doesn’t even exist. Starts with the need for fingerprinting, in this case, it could be fully replaced by cookie. At the time I decide to use fingerprinting, I just afraid that people will delete the cookie thus generating a new different identifier. Turns out this is just me hallucinating that the majority of people even know how to do this. By ignoring this problem, I could just generate an identifier using UUID, put it on the cookie or local storage, and call it a day.
  • Mistake #2: Always have a fallback. Making user stuck on the home page is the worst thing could happen on a website. There should be a timeout in home page (if I decided to still use fingerprinting), after the timeout, the fallback identifier generation should kick-in (e.g. UUID or whatever the browser could do on client side). The backend also should have fallback whenever a request doesn’t contain any user identifier.

Lesson 4: Keeping Secret

The main concern for this development setup is also the stupid one: no proper secret management.

I learned it the hard way: ALWAYS HAVE A PROPER SECRET MANAGEMENT, even you will never ever publish nor put your code on production.

One of the media that interviewed Asanilta Fahda wanted me to publish the code for checking the fairness of our system and to ensure there is no foul play behind the scene when calculating the result for the quiz. I quite panicked: reCaptcha secret, user token salt, and publicly accessible cloud function endpoint URL are hardcoded.

Just deleting those secret and put it on proper environment file will not delete the git history. At this time, I found out that someone already created a tool for fixing this stupid mistake. On GitHub help page, there are two alternatives that I could do: using git-filter-branch (that looks too complicated for me given the time) or using a tool called BFG.

The tools save me a lot of the time. Though on the git history you could still see my sins (the tool only replace the sensitive data with word ***REMOVED***). After applying this, I think: “OK, this is safe enough I’ll push it and make the repo public”.

Then, here comes my best security engineer friend: @visats. He found my hash salt. I forgot to put compiled python script (*.pyc) and pycache folder to the gitignore file, therefore, it is included on the git. He decompiled it into python script and the salt is shown in plain text. Thus, I have to delete the compiled file and the cache folder from the git history as well.

Lesson learned: Never hardcode the secret whatever the reason, no exception. If you did, delete it completely, just delete it on the last commit won’t be enough. Double check for compiled files that are included on git for the secret. Much better and safer is to reset or change all the secret after doing this kind of mistake.

Lesson 5: Website Serving

One thing that clearly could be improved in term of serving is to make the website front-end fully static so I could just put the code on some kind of storage / CDN. Thus, no hosting needed. I heard about this idea using AWS S3 from my colleague back then but haven’t really ever tried it out and afraid that the cost would be higher (seems unlikely though).

Thanks to Muhammad Mustadi, he told me via twitter about serving the static website on Netlify and Zeit which could make the process easier. He also taught me how to make the code a static one and what should I change on the code. Unfortunately, I didn’t have time, and doing so also didn’t decrease the cost significantly when the website is not as busy as the day before the election. So, for now, I didn’t implement it.

Lesson learned: Make the website static could reduce your cost, and no hosting needed. Though not implemented, this knowledge could be used for future projects. To make a static website (on NextJS), you have to make sure that InitialProps does not access the context and all the cookies should be client-side or use the browser’s local storage (basically, there shouldn’t be any code that should run on server-side). For further information about this, NextJS already has a good tutorial.

Several more lesson…

Caching

Fewer Back-end

Cloud Function with Datastore

Analytics Tool

User Interface

The first iteration of the UI already looks like the final one. Simply because we have no time to develop it further :D

For the UI, I only use Bootstrap v4 with small modification here and there. Mobile first design is well paid off as almost 90% of our users use the mobile phone to access our website.

You know when created a good product…

One of the users that really want to use the website

But after I did this website, which has good content thanks to the awesome teammates who did the research, my perspective kind of changed.

When you create a good product, whatever error your user is facing, they simply will come back and try hard as they can to use the product.

And then, …

I really would like to thank Indonesian for your support and enthusiasm for our work. To software engineers that starred the repository, thank you! I knew fully that it is a bad code, not something to be used for example, and your appreciation despite how bad it is, is really meaningful for me.

And most importantly, thank you Asanilta Fahda for giving me the opportunity to create something (again, I guess*) for Indonesia.

(*) Anyone can help me develop Hoax Analyzer? ;)

You might also found out another design mistakes, bug, stupidity from my code on the GitHub, I would really love to hear from you so I could learn more and get better. Please, message me on any social media or comment. Thanks! :)

Check out my new project at https://kutu.dev