Automation & Data Engineering Journey towards data accuracy

Kyaw Zaw Win
Technology @ Sleek
Published in
4 min readJul 12, 2024

Disclaimer: I am a Senior Full-Stack Software Engineer at Sleek.

I am writing this article about how we overcame some of the challenges for thousands of continuous record extractions to ensure our platform data is accurate in near real-time. Additionally, I will touch on key points regarding how we made the bot scalable and resilient to avoid human intervention as much as possible. Our bot can automatically re-trigger whenever unexpected problems occur, ensuring it is cost and time-effective while maintaining data accuracy from the source.

Background

As with any modern tech-enabled company, our data grows day by day, and we need to ensure the data on our platform is accurate. This means our platform data must match the source of truth as soon as possible. Initially, we used the API service provided by the data source provider. However our data source providers often expose limited information over API, and not sufficient for our needs. Hence, we had no choice but to delve into robotic automation to overcome these challenges, which not only helped us achieve our goal of data accuracy but also saved us time and cost.

Beginning of the Exciting Journey

One challenge of using RPA is solving for rate limits and bot detection. More often than not we need data from such protected sites to conduct business effectively. Initially, we tried using undetected Chrome driver, but we found it didn’t work most of the time for both local residential IP addresses and cloud IP addresses. Therefore, we had to develop our own solution through trial and error, and we finally succeeded in extracting tens of thousands of records repeatedly and continuously from bot detection protected web portals using our bot.

Here are some key experiences from our journey:

  • Data Cleanup: Ensure data is clean (filtered) before extraction. Clean data is crucial for successful automation and effective AI model training.
  • Avoid Retrying the Same Record: Rate limits typically rely on IPs, browser, and user agent session IDs. Querying the same record successively will lead to a temporary ban of 4 hours or more.
  • Limit Continuous Extraction: Never extract more than 100 records consecutively within a session. We do not retry the same record whenever we encounter a bot detection error. There is a possibility of encountering a random bot detection error once for every 50 continuous records within a session, which increases suspicion. So, why do we extract 100 records instead of fewer than 50? It depends on the bot’s architecture based on the robot platform provider and several parameters such as timeout, estimated processing time per record, frequency of the schedule, number of available workers and average performance of them, etc.
  • Pause Between Batches: Pause for a couple of seconds between batches.
  • Mimic Human Actions: Move the cursor or select elements randomly to mimic human behaviour between extraction.
  • Use Non-Headless Mode: Always use non-headless mode.
  • Use Residential IPs: Use residential IPs instead of cloud IPs whenever possible. We deploy our bot in cloud infrastructure, but the outgoing traffic goes through our own Internet gateway. The proxy approach (both private and well-known proxies) does not work in our case. Although cloud IPs work fine, they are easier to get banned for longer periods compared to residential IP addresses.

Here is how we deploy our bot.

Bot setup

In addition, choosing the right automation platform is crucial for us. We chose Robocorp (https://robocorp.com/), which is a great automation platform that provides flexibility and allows us to design our bot in a scalable way. We can complete our process in a shorter time by adding more worker agents, as we can process chunks in parallel.

Moreover, deploying automation bots to the Robocorp platform is fast, and setting up worker agents is easy and quick. It does not require large instances (we run our worker agents on medium instances with 2 vCPUs instead of the recommended 4-core CPUs). The API support to trigger the automation process to start and stop is perfect, allowing us to fully automate and seamlessly integrate with our microservices that power our platform. This provides us with a better way to scale the bot horizontally as data grows faster.

Finally, it is important to design scalable bot architecture in a way that contrasts with traditional application development. In traditional client-server based application development, we focus on scaling the server side to handle concurrent and simultaneous traffic, aiming to maintain consistent performance of single requests and unit of task. However, designing a scalable bot should be approached differently. Scaling a bot requires consideration for amount of traffic sent to a website, as well as balancing this with cost to run the bot.

I hope these technical tidbits reduce your barrier for bot adoption!

--

--