A little bit of context
The basic infrastructure at heycar was built in 6 weeks. Of course, at that time it was a very raw platform, despite that was a huge achievement.
Our core differentiation is the quality of what we offer. Therefore, we all have to keep an eye on the User Experience. With that in mind, one of the first things I’ve noticed when I joined the company was the huge payload of the search results page mostly given by our images.
The market we aimed is quite secluded so we had to adapt to it. The images come from integration with car dealerships into an unified API, we had to deal with that.
Most of the times they have a reasonable size for input, but they are not normalised and optimised. Sometimes you have a raw 2MB image straight from the camera.
We’ve invested a bit of time on gathering data about size, format, and aspect ratio.
Later we measured both the impact of the images on the web-page, and it was surprising:
The average size of an single image was 380 KB.
Given that we have 18 vehicles displayed per page, that’t almost 7 MB of data.
It was obviously a big deal, given the percentage of mobile devices that we serve content to.
One of my first assignments was to figure a way to optimise that.
The first option was to take advantage of our fan-out architecture and resize images to a reasonable resolution and quality, so that we could serve that version.
The good part is that it doesn’t impact on load time, also it makes it easy to cache. However, it was very limited considering things that might come next. e.g.: more mobile support, different pages with different sizes of images…
All of that would require us to implement something flexible and fast, in order to not impact on time-to-market of our listings. We want them to be published as fast as possible from the time they are sent to us.
The second option was to use something that could resize the images on time of request. Which creates an inversion of control, the client decides what and how it wants the images.
This inversion of control empowers the client, creating flexibility for an infinite amount of versions. e.g.:
“https://image-service.hey.car/a23b07d21.jpg?w=100” — would give us a resized version of the image with a width of 100px.
However, by doing that on request time, it implicates on the final page-load-time, if the server takes too long to resize each image.
Time to compare
Both options seemed reasonable. The first option would require implementation and it impacts on publishing time. Whereas, for the second there are plenty open-source tools, eliminating the need for implementation. No code is the best code, right?
Yes, it does impact page load-time, but only for uncached requests. The worst part is having another user-facing service to maintain and monitor.
After the comparison, our decision was to go with the second option, using Thumbor, which gave us a lot of features out of the box, and pretty much no need for coding on the server side.
It won’t impact the time to market of our listings, and also is a battle-tested solution, which eases the worry of maintaining.
Thumbor is a smart imaging service. It enables on-demand crop, resizing and flipping of images, … save time and money in your company with Thumbor
There are plenty options of course, but we couldn’t find anything more mature or battle-tested than Thumbor.
Besides its core functionality, Thumbor has a huge amount of features, e.g.: face-recognition, smart crop, browser optimisation, filters, watermark, …
That made it a jack-of-all-trades for us, it helped us shape images for integration feeds and other internal tools for image-classification too.
We needed a docker image, and for some reason, Thumbor doesn’t have a official one. Therefore, we had to build our own, which is not tricky, since thumbor is a Python package.
Our Thumbor docker image is open-source and available here.
Once thumbor was dockerized, we just had to push it to our Kubernetes cluster. In order to serve it, we hooked that up with our ingress-controller (nginx), and setup a DNS record from Route53 to our Load Balancer.
We’ve foreseen the need for a caching layer, and CloudFront fits like a glove on this scenario.
It looks like this:
It was easy to switch, we had the whole infrastructure in place before migrating the clients, which were the ones requiring the images.
We loaded tested first, then rolled out the clients one by one, and the infrastructure handled just fine.
Obviously, serving resized images live would have an impact on page load-time. But it was much smaller than we first thought.
We started with 3 instances, each having 1 CPU and 2 GB memory, that handled just fine our traffic at the time.
We managed to reduce 10 times the payload, images are now on average lower than 40KB, as you can see on this page-speed test:
Latency is also very low, even for the uncached responses, less than 200ms, and also for the cached ones less than 60ms.
The total size of vehicle images on this page is around 600KB! 10 times less than the previous 6MB. Our mobile users cheer!
Monitoring & Maintaining
Monitoring is quite easy with Thumbor, since it uses statsd by default, you can easily setup on your infrastrucutre. We have been using Datadog, which works by default with statsd. But you can use open-source tools like Prometheus as well.
During the last 6 months using Thumbor, we went from from 100 req/s on our website up to scaling to a TV campaign in the world cup, with 2.5k req/s. Thumbor was never an issue!
We do have autoscaling for thumbor, based on CPU usage, if it gets too intense for too long, it spawns new pods to handle the spike on traffic. That’s been running unchanged for pretty much 6 months.
The only manual intervention we have done was during the spikes for the TV campaigns, where there is no time to react, it’s a 100x jump in 15 seconds.
At the higher spike on traffic, we where using 15 instances of Thumbor behind our load-balancer, which was overkill, we could’ve survived with 10.
You can check,on the chart below, our stats for the last ~6 months with thumbor:
Please note that this chart represent only the uncached requests, the cloud-front ones you can check on the caching section below.
Our context implies that the same images will be seen for amount of time of 60/90 days maximum, which is very interesting because it gives room for a high rate of caching on the CDN side.
During normal operation we reach ~90% of the images served from cache
We have a lot of new listings being ingested during the mornings, therefore this rate drops. Also, during the TV Spots, a lot of inventory gets “discovered” by the users, which make us reach much lower rates, but they are still quite high for caching standards.
As for amount of content, we jump from ~90 GB per day up to almost 400 GB per day on TV spots, yet that impacts almost only AWS Servers.
The main thing to take from this is the power of open-source platforms, they can take you very far.
Sometimes we jump to paid solutions or even worse, to implementing something of our own which is not the “core business” of our companies, whereas we could easily use something open-source.
If you have a nice environment it should be easy enough to use them without having to code your own, or pay for a 3party one.
Open source isn’t only about saving money, it’s about doing more stuff, and getting incremental innovation with the finite budget you have — Jim Whitehurst, RedHat’s CEO
There are no silver bullets, you can invest some time on at least trying the OSS solution. In the worst case, you’ll learn more about the requirements of the tool you’ll need no buy or implement.
Thanks for reading, if you have enjoyed please remember to thumbs up and follow us for more insights on how we create heycar.
And, if you want to work with infrastructure, thumbor, kubernetes or any related topic, take a look at our careers page: