At the beginning of 2018, when I was packing my bags to move from Moscow to Toronto, my dear colleagues, whom I didn’t know yet, were receiving an increasing amount of requests for insights into how Ada’s virtual assistant was doing. How many conversations did the virtual assistant start and received a response from the user? How many conversations happened on a mobile device? What are the most popular responses virtual assistant gives to French-speaking customers?
Both the Development and Customer Success teams rushed into answering those questions. Ada was barely a dozen people by that time, and all of them did their best to give clients their insights. But we had a huge feature roadmap for 2018, so it was not until the late fall that the analytics feature got the attention it deserved.
What did we have in the beginning?
When I joined, there was no specific storage for analytics data. We had production MongoDB clusters hosted in the cloud in few different regions. The largest cluster had about 124 GB of data by that time. Didn’t look like Big Data to me, eh?
Since we are providing a virtual assistant platform, the production database stores all the data related to conversations all assistants had (messages, conversation details, etc).
Unfortunately, the aggregation of that data, or even counting the daily conversations was taking forever. We used MongoDB time bucket pattern for aggregating some of the metrics (for example conversations per day).
Why building a new analytics page and data warehouse?
While aggregated data is great, both internal and external clients wanted to have more insight into Ada’s performance for various reasons. Product wanted to have an idea of how new features were doing, Automated Customer Experience team wanted to help clients tune their bots, Clients wanted to assess solution’s efficiency on their side and so on.
Development team started to build custom Python scripts that dug into the production database, scraped the desired data from there and generated csv reports. This started to happen more often for a few different clients. Then came requests for recurring data exports, also non-developers wanted to generate some reports without our help.
So we used ndscheduler to schedule execution of custom Python scripts. The number of scripts was growing, there were already a few developers working on them nearly 100% their time. Here came the understanding that we need changes.
Project requirements and constraints.
Everyone in the company was hyped about rebuilding analytics. The document with all the “wishes” looked exciting and intimidating at the same time.
We managed to narrow down the requirements to:
- All users should be able to filter analytics data by different parameters like language, device, etc. in Ada’s dashboard.
- Analytics data should be accessible to internal users through SQL queries.
- Analytics can be not real-time but should update at least once per day.
- Analytics should be easy to maintain, not expensive and scalable.
Still, this list happened to be too large for a single project, so we split it into Ada’s Analytics Dashboard and Data Warehouse projects.
Analytics Dashboard was about front-end redesign and backed SQL layer implementation. While Data Warehouse was dedicated to setting up an analytics database and ELT. This “divide and conquer” approach helped us to implement these two projects simultaneously.
What technology stack to choose?
When choosing a technology stack I was assessing performance, cost, ease of use for non-technical folks, maintainability, and scalability.
ETL + ELK
ELK stands for Elastisearch, Logstash, Kibana and is a powerful tool for real-time logs analysis. Performance depends on the amount of RAM for the cluster. If the full index is in RAM search will have close to zero latency. This solution also supports storing similar information in one cluster to enhance speed. ELK can be hard to maintain if the index is growing big, but scaling is achieved by adding new nodes. Cost is also reasonable, unfortunately, ELK has very limited SQL capabilities and is build for search rather then data digging. Elasticsearch query language takes some time to master.
MongoDB connector + BI tool
Our production data lives in MongoDB so the easy win might have been a connector that will allow people to query Mongo with SQL. This solution is cheap (free for us as Atlas users), easy to use, but it was not clear how it will scale, as well as how to do transformations of the data along with simple mapping. That seemed to be a good solution for quick MongoDB exploration with SQL, but how that would have helped us to achieve the goal of Ada’s dashboard revamping was not clear.
Hadoop + Spark + SQL database + BI Tool
I am no Big Data expert, but I figured out that a similar stack is the most popular set up for data analysis for Big Data. It is powerful, scalable, supports SQL digging but close to impossible to implement and maintain for a bunch of people with no experience in it.
AWS Athena + AWS S3
We were already using AWS infrastructure so ease of use and maintenance should have been the main advantage of this stack. The disadvantage, however, is the fact that for Ada’s dashboard we still needed a backend SQL database and a way to transfer data there. AWS Athena pricing is based on the amount of scanned data and that implies that you are using it as a tool for data exploration rather than as a SQL API for the backend data source.
MongoDB + ETL + AWS Redshift + BI Tool
This stack looked like the most suitable one for our needs. Redshift is fairly cheap, scales well and supports SQL. Most of the BI tools support Redshift as a source. The only disadvantage is that Redshift lives on AWS but it can be replaced with PostgreSQL vendor-free solution if required.
What ETL tool?
There are plenty of them, both SaaS and standalone libraries. I felt that we better not use any SaaS in order to have full control of our ETL process.
Current analytics architecture.
ELT pipeline was implemented with a functional approach in mind. Every few minutes Airflow DAG responsible for data sync between MongoDB and Redshift is started. It gets all the documents that were appended since our previous sync and converts them into a Redshift row.
Converted rows are saved to a staging table in Redshift, then, rows with the same keys are removed from production tables and staging tables are appended to production. This approach allows us to efficiently run backfills.
After data is synced we can start the Transform part of ELT. SQL queries are used to transform synced data into analytics denormalized tables. Some columns in synced data have JSON format because it is easier to extract data as JSON from MongoDB (for example complex objects inside the document). PostgreSQL command
JSON_EXTRACT_PATH_TEXT allows handling transformation for such columns.
We process about 1 million documents per day, ELT process is incremental and parallelized which allows us to perform each incremental update it less than 10 minutes.
All our infrastructure is managed by Kubernetes and Dockerized, so, we wanted to run Airflow as an additional service on each Kubernetes cluster. Another wish was to make Airflow tasks execution parallel and scalable. That is especially important for backfills when Airflow needs to process a lot of data.
To Pod Operator or to not Pod Operator.
There are a few ways to achieve scalability and parallel execution for Airflow, however, since the technology is pretty new, there is no right way to do it. One option we looked at was Kubernetes Pod Operator along with Kubernetes Executor. Essentially this approach allows creating a pod for each task, which scales nicely but its hard to develop and test on the local environment.
Another option that we are using right now is more popular and tested by the community: Python Operator along with Celery executor. This architecture is easier to handle locally and it scales by scaling the amount of pods with Celery Workers. We found that a couple of workers can easily execute several dozens of concurrent Airflow tasks.
Managing Airflow reporting database connections.
Each Airflow task requires SQL connection to Airflow’s internal database so it can update its execution status. We are using PostgreSQL and it runs out of connections quickly if a lot of parallel tasks are running on Airflow pods. The solution for that issue is to set up PGBouncer. In a nutshell, this service keeps a pool of open connections, so Airflow tasks are using these connections instead of opening new ones every time a new task is spawned.
There are three main clients of our Redshift data warehouse: Ada’s dashboard, people digging into data manually or with a BI tool and automated data exports. Also, we are using Redshift resources in data transformation.
Each of those clients has different requirements. With this in mind, we defined two WLM queues. The fast queue with 10 concurrent slots and 30% of memory usage for Ada’s dashboard analytics; its queries are usually fast and small (aggregations by date with filters). The second queue serves everything else: heavy human analytics queries, BI tools, Transformations. It has 70% of memory and 5 concurrent slots.
We use 4 dc2.large nodes which cost us around 750$/month, whole data storage takes under 100 Gb by the time of writing. CPU usage rarely goes over 100%, but when it does it is usually our fault setting up a cluster in the wrong way.
On that note, if you are considering switching to Redshift keep in mind that it is a distributed database that uses columnar storage and massively parallel processing. This means Redshift is not an easy tool you can set up with a few clicks. I would highly recommend researching how it actually works (data distribution, caching, WLM queues, query optimization, query compilation, etc.) before rolling it out to the production environment.
Two main points of failure we care about are AWS Redshift and Airflow.
- To make sure ELT infrastructure is alive (Airflow pods are alive and well) we use a DAG whose only task is to ping Dead Man’s Snitch every 10 minutes. Pagerduty notifies us in case it didn’t see any activity on Dead Man’s Snitch.
- In order to see notifications about failed DAGs in Slack, we use Prometheus along with Airflow plugin that exposes tasks statuses in the separate endpoint that Prometheus can use as a target.
- On the backend side, we log all SQLAlchemy exceptions in LogDNA. The latter allows sending an alert if exception appears in logs.
All alerts are aggregated in a dedicated Slack channel.
Redshift is (obviously) not the best choice for the backend database.
Redshift does an awesome job in data transformation and data digging, however pointing user-facing part of an application to it should be only a temporary solution.
The main issue is that query compilation takes time and significantly slows down fist query execution. So queries that are executing often are lightning-fast, while more rarely used ones require some time to compile.
Since queries can be long, synchronous execution blocks down endpoints from doing any other job, which decreases the overall throughput of the system. A better approach would be doing async queries or using PostgreSQL instead of Redshift.
ETL is too slow, do ELT instead.
When I started working on this project I used the naive approach: created python script for each analytics table and transformed objects from MongoDB into something useful inside Airflow DAG.
This happened to be completely useless. Backfill for the month of data was taking a few hours. We didn’t meet our development deadline and threw out this code altogether. Luckily, more experienced data engineer joined our team by that time and pointed out that with some SQL magic we can transform data in Redshift. However, ELT is still a valid solution if you use the right tool for it, like Apache Spark, but that was clearly out of the scope of our project.
Airflow is cool.
Airflow showed itself as a fairly simple and reliable tool, something that you do not usually expect from a young open-source library.
If you have a few different clusters, you will need to set up a single point of access.
Many web services have separate clusters for each region they operate. Ada is not an exception. Having one database per cluster works well for backend, but for data exploration, your Data Analyst will probably want to see all data in one place (at least access it from one place). One way to accomplish that is to set up a federated database. We are currently in progress of making an instance of PostgreSQL our federated database.
Data Infrastructure will take some time to build.
This was the most important non-technical lesson I learned as the Development Lead of this project.
Be realistic in your estimates. Be bold in conveying these estimates and scope to stakeholders. My teammates and I spent some time on grasping all the new concepts and technologies before building a production-grade solution because we aimed for high quality and reliable product. All the changes described above took about 4 months for a team of 1 to 4 people to implement.
I would order the parts of the project by the amount of time we spent on them like this:
- Frontend part of Ada’s dashboard. That part was completely rebuilt, one person worked on it, and half of the time that was a co-op student.
- Airflow setup. We were new to this technology, so it took some time to set it up, but at the end that definitely paid off.
- ETL development. Our first try to figure out how we should scrape the data from the production database was lengthy and taught us a good lesson.
- Project research. The first month of this project was dedicated to various POCs, technology research, and requirements clarification. I believe that this preparation helped us to figure out the right scope and set up correct expectations for everyone in the company.
- ELT development. When we decided we need some SQL magic that took some time to actually implement it.
- Although few of us had relevant experience in data processing and organization the number of efforts we put into researching, planning and managing the requirements paid off. Therefore, backend SQL data layer, Redshift setup, backend and frontend integration, even bug fixes in both data processing and dashboard functionality took less time, which contributed to the success of the project.
One thing we noticed was that within Ada’s dashboard it is hard to provide the same digging capabilities as BI tools do. Well, that’s probably because we are not aiming to build a BI tool. Our clients still want to have batch data exports and we still are using custom Python scripts with the scheduler to do this. The next step should probably be data API, which will allow doing exports from our newly built data warehouse.
This whole Analytics project would be impossible without my brilliant teammates. Thank you, dudes! 🤗