Continuation of my pet project story, which went viral before it was complete and ready. Involves a lot of sleepless nights, related stress, completely nerdy description and twists but leading to the great feeling of achievement and will to improve both — the bot and myself.
If you’ve missed the first article covering the history of my “fastest telegram bot” project — you can find it here. It’s been 5th anniversary for the LittleGuardian, and plenty of things have changed since the last article was published. I’m going to cover the changes, but most importantly — even more lessons I’ve learned. Saga to reach the best telegram anti-spam and group management bot status continues.
Database, paying the debt.
I need to admit — the database I’ve created initially was an absolute opposite of the “optimal”. As I wanted to keep everything difficult to guess but at the same easy to identify — I’ve decided to use UUIDs everywhere from the bot configurations table, through messages logs used by AI to learn and identify the spammers' behaviour, groups settings, even users profiles — all of that in one big UUID mess. Blaming myself — I just kept adding new tables for new functions without thinking twice, sometimes additional columns.
Everything was going quite well until the table with logs exploded into 50GB one, with half of it being indexes. Queries became quite slow, and I needed to upgrade the SQL server twice to handle the increasing traffic. Migration to MySQL 8 generated a total of 8 hours downtime overnight. It was far from acceptable and extremely stressful at the same time as the main goal of this project was providing users with the fastest telegram bot, who can manage their groups with ease. It took me two days with my Remarkable, planning the new layout, mapping relations and then coding all the migrations for the new layout.
If you’ve noticed different table names — you are right. I’ve been performing migration on the live service, with the comfort of microservices using bits and bobs of the data. Tables were located in the same database, services for three days of testing were using both versions to verify nothing is missing, and all the queries were working with and receiving the information they were supposed to. I’ve used this time to write appropriate queries to migrate most crucial data “when the time comes”. By the time I’ve fixed all the bugs, flipping all the microservices to use new database was as easy as removing the dependency on one file and took exactly 15 minutes.
Next step was naturally — observing the microservices behaviour, monitor all the queries ( traditionally — slow query log and queries not using indexes ) and react accordingly. All of this work allowed me to scale SQL instances back down again and remain below 40% of utilisation with 600/100 r/w queries per second.
Lesson learned: Plan the planning, plan for the speed, plan for expansion.
Queues and the ultimate solution
During the move to version 6 of the bot I’ve decided to use RabbitMQ as I’ve had some experience with it, it’s fast and quite easy to manage. As I’ve been running RMQ in the docker container within the cluster, I was forced to re-think my approach because of the following:
- Badly designed application logic — some microservices were declaring the queue or exchange, but the consumers required them. If declaring microservice was failing, consumer requiring the queue to exist was failing as well. Yes, I could’ve used the durable queues and exchanges but because of the rest of the issues listed — I needed another solution instead.
- I’ve been transporting messages as JSON — which on the sender side required me to convert the whole struct into JSON, on the recipient side — unmarshal it back into the struct. Little thing but both processing power and memory consuming, especially when repeated hundred times a second by dozen of services.
- For some awkward reason, messages to the same kind of microservice were sometimes duplicated or processed multiple times — I’ve wasted quite a while to debug this — with no joy.
- RabbitMQ container started consuming way more CPU and MEM than expected, especially after an attempt to increase the size of the cluster to run on multiple nodes.
- Finally — and nail to the coffin — constant connectivity issues. For some unexplained reason, connections within the k8s cluster were closing randomly, messages were patiently waiting for the delivery, but latency went through the roof, with RMQ randomly refusing connection as well.
One day research pointed me towards the NATS.io which I’ve decided to give a go as a proof of concept — with main “bot” microservice ( the reader from telegram APIs ) communicating with logging microservice. Easy to check, simple to verify. The whole library I wrote beforehand to handle all the nuances of connectivity, queues and exchanges declaration was replaced by one simple function — connecting and setting up the health check. Consumer code changed into even more simple auto-mapped into the appropriate structure and therefore easier to control and debug.
All the issues mentioned disappeared, and migration of all the remaining microservices from RabbitMQ to NATS was done with the ancient copy/paste method over a few hours in an absolutely painless manner.
The only thing I could potentially miss is the dashboard RMQ had — I’ve tested few available for NATS, but none of them was of good quality or fully functional. Maybe if I’ll have some free time, I’ll create something to opensource it, but that’s the plan for 2030 as of now.
Lesson learned: Most popular doesn’t always mean the best.
Users are your source of truth.
Any service you create is and should be designed with users in mind. Really often it’s not what you find logical and reasonable counts — but it is your users who should have the final say on design and functions. I have noticed that most of the users had ideas or something to say, but they didn’t really know how to explain it or were saying “it doesn’t work”.
Step one: I’ve created the bot knowledge base which I’m trying to keep up to date. Describing how certain functions work, concepts behind the settings and design in easy to understand by everyone language.
Step two: I’ve added hotjar to the page code to see what users are doing, how they’re doing things and where they ( or rather me ) fail. Plenty of changes in the past few weeks were shipped because of both user direct feedback and the hotjar research. Excellent Medium article on the app itself can be found here.
Step three: Keep your users in the loop. Most of the businesses want to show themselves as amazing, never making mistakes believing it builds trust with the user. It’s like when you were a child, going with your parents to see their friends — bringing only the best parts of your life with you. I am happy to admit to mistakes openly, so my users know that I am at least aware of them, not mentioning constant work on improvements. A separate page with the bot news together with dedicated announcements channel on telegram does the job really well.
Step four: Often forgotten — Google Analytics to have an insight on the telegram bot user panel doings. I’ve added custom events for everything that could go wrong, to have an oversight of most common issues and mistakes I’ve made. Yes — I have made. It’s rarely users fault. It’s not Steve’s Jobs famous quote “You’re holding it wrong”, more likely — you gave users impression that they can do it, and that’s the best way to do it.
Step five: Build automated “maintenance mode”, starting with automated healtchecks and endpoints on the status page ( thanks to FreshPing ), ending with the self-checking website monitoring endpoints and traffic and displaying visible to everyone banner with information about the works as soon as it detects any possible disruption within the system. No harm in doing it, but that definitely decreases their frustration, making them stay with you.
Lesson learned: Open yourself to the users. Ask for their opinion and do the research. The first impression really matters.
Going worldwide? There’ll be a lot of work.
One month into the new version, thanks to the Google Analytics platform, I’ve had a great overview of the users base, which was… Have a look yourself.
Naturally, I couldn’t expect everyone to speak English — but after consultation with my close friend hipertracker and taking his suggestions on board the work on localisation started. Replacing all the text with i18n definitions was definitely time-consuming, and BabelEdit helped me tremendously at that stage. After setting everything up in English, Polish and Farsi — I’ve had another look on the Google Analytics to see the languages my users use, to get the professionals to take care of the translation to Arabic, Spanish, Hindi, Indonesian and Russian. My real-life friends and even neighbour (❤) took care of Turkish ( Emir Taha Bilgen ) and French. The project was all set for the brave new world.
There’s a problem which I discovered when it comes to paid and external translations. My content changes quite often, sometimes I want to re-word documentation to make it easier to understand. I constantly work on new functions and handling multiple languages at the same time is close to impossible to do both time and budget-wise. I’ve tested a few crowd translation applications online and finally settled with localazy as the best solution.
Localazy seems to be a perfect match for my project needs — its easy to use for both admin, developer ( amazing CLI — pull your translations with a single command, commit and send for the build ) and people helping with translation. Their support is one of the best I’ve ever seen — not only responsive but also reactive and extremely helpful. I‘m planning to stay with them with my upcoming projects as well as one of the helpers described it “Hell yeah! It’s even fun man”. It also gets rid of all the juggling with JSON translation files, parsing issues because someone forgot to put the comma or their editor put the weird “” somewhere on the way.
I’ve opened translation to everyone — so feel free to visit if you’d like to contribute.
Lesson learned: Make a move towards your users. Not everyone speaks English. Even less — good English. Find something that works.
Was the job done? Not really
The real difference between my project and competitors which started to appear (which I actually love because it’s a healthy competition, new ideas and learning from others after all ) — is the reaction time to events. Two weeks ago, my systems spotted issues with Telegram before they were even officially announced.
This week — Login with Telegram went down ( and is still down at the moment of writing it, 48 hours later). I’ve spotted it just before midnight on Wednesday night, and after 45 minutes of debugging reached out to the Telegram support ( as the problem is global and every website using login with telegram is affected ) to notify them about the issue, then went to bed hoping for the resolution before morning. It didn’t happen, unfortunately, which made me re-think my approach to the authorisation. If my users can’t log in — they won’t use my product. I can’t do anything with it as I strongly depend on the third party, in this case, can I?
Thanks to the microservices approach and how easy it is to add new functions — I made few modifications in the telegram bot API library I’ve been using, created command /login and deployed it to my live bots, updated the website with information ( yes I know, it looks a bit wonky now ) and a stream of users started flowing. As far as I know — my competitors have not realised there’s an issue until now.
Lesson learned: Be proactive, be reactive.
Monitor, alert, analyse and improve.
I’m doing everything to keep the microservices below 100ms line. The only exemption is the ‘joiner’ microservice doing a ton of calculations for every person joining the group, analysing their past behaviour and running the data against models to feed AI ( Obviously, there’s a plan to improve it! ) It’s just one of the few charts I keep an eye on, most of the things which should be alerting trigger both text and slack messages.
Deploying every user-facing change to the panel can be observed within Google Analytics. Does it increase users engagement, actions per visit, events? How about the time on the page? How many users have logged in and browsed “restricted” area? Website statistics are not only to know how many visitors you have — those are kind of pointless, especially when visitors spend a few seconds on your page and leave.
You can see the peak on the 4th of October. I’ve released new function but forgot to verify all the possible configurations on the user side ( and there’s quite a lot of them ). It resulted in people visiting the panel, but some of the functions being offline, which turned the flow the opposite way from desired. I’ve learned from that, engaged with most active group admins from all around the world who became my beta-testers. Another drop two days ago was caused by the telegram login issues I’ve described earlier.
Rest of the world is not London.
When you design the user-facing site quite often you’re tempted to add few fireworks here and there, make it nicer and more modern for all the visitors. You’ve checked it out — loads quickly on your Macbook or iPhone using your fibre connectivity.
Things I’ve learned during this stage of optimisation were as follows:
- 85% of my website visitors are mobile users.
- Traffic comes from all around the world, including countries without the privilege of 4G.
- Every byte and request matters.
With all of this in mind, I started the optimisation by removing unnecessary queries, combined the important ones into one. Images — okay, the website would look even worse without them — but webp format is almost everywhere, yet — not all browsers support it.
Was that a problem? Not this time because...
Now user browser could decide on its own which version of the image it requires.
Hunt for making the downloads as small as possible moved towards JS, CSS and literally anything I possibly could make smaller. I almost gave up, but then my friend mentioned Cloudflare — which I gave a go. I definitely made few mistakes here and there trying to optimise the things a bit too much, but it works. Completely fine for a small monthly fee which is absolutely worth it. I can also strongly recommend having a look at their add-ons which I find absolutely fantastic.
Users were definitely happy, visits increased, bounce rate went down and on the side note — CloudFlare also allows traffic from countries usually blocked by default on Google Cloud, because as we know every packet smuggles some contraband.
Lesson learned: You can optimise absolutely everything.
TL;DR — Recommendations
- Google Analytics ( both web + app ) - traffic analysis
- CloudFlare — speeding up the website loading times
- Localazy — anything related to localisation of your app ❤️
- NATS — fast and reliable cloud-native queues
- HotJar — user research and behaviour analysis
- FreshPing — status dashboards
Disclaimer: I am not affiliated with any of the businesses I’ve recommended. I find their products amazing and worth consideration.