Platform for success: The Telegraph’s big data transformation
At The Telegraph, the Data Team is a cross-functional group of engineers that manages the core big data platform. It has the responsibility to orchestrate hundreds of batch and real-time data pipelines in order to ensure that the process of data ingestion and transformation is performed on time and is compliant with the highest data quality standards.
This allows analysts to provide reports and insights and the business to make decisions based on reliable figures. Also, The Telegraph’s big data platform is used to support the Data Science Team in delivering predictive models, and the Data Team has an important role to automate the training process and deploy the models in production.
When I joined The Telegraph in 2015, the journey to become a data-driven company was in its infancy. The business was eager to have more data and easier access to information but the infrastructure in place was not yet ready to support decisions in the proper way. Most of the data processing that the data team was doing at that time was done through a single Hadoop cluster used to run few data pipelines on daily bases. But even managing that single cluster was an arduous task due to the lack of direct control on the infrastructure and all the dependencies involved in every minimum change needed.
A major problem at the time was also the waste of computational resources; some pipelines were very demanding but only for a short period of time and our infrastructure was not scaling as it should, which always meant that some nodes were left idle. It became clear that some changes were necessary in order to support the growing digital business with a secure, resilient and auditable data platform able to scale, drive decisions in multiple areas and allow the kind of flexibility that is always paramount in Agile environments. After evaluating all the challenges that the current infrastructure was carrying with it we decided to move away from Hadoop and AWS and as a data team gain more control over our platform.
At the end of 2015, after a few discussions and demos, we were really impressed by all the possibilities that Google Cloud Platform was providing. Bigquery, the data warehouse solution proposed by Google, immediately caught our attention. No infrastructure was needed, there was an affordable pricing model, low response times and high scalability — it seemed exactly what we were looking for! So at the beginning of 2016, we unanimously decided to adopt GCP and the team started to build what is now known as the TMG Datalake.
The first challenge we had was to redesign in a future-proof way our orchestration mechanism and data pipelines, to avoid any waste of resources and allow scalability. Instead of using a cluster of workers we decided to take a different approach and focus on creating single workers on demand and from them push all the computational effort to cloud services when possible. We picked Azkaban as a scheduler and orchestrator and we adopted a very simple paradigm: each time a data pipeline runs, a new worker is created to perform the ETL task. Instead of being restricted to the computational power of a single worker and processing the data in the virtual machine, all the work (when possible) is pushed to BigQuery and the data transformation and aggregation is performed there. After all the SQL queries are executed with success, the result is persisted in our data warehouse and the worker is terminated.
GCP’s Learning curve was less steep than expected and after a couple of months, our first data pipelines to ingest hitlog data from our website and mobile application were ready to be tested. The running time of both processes turned out to be much shorter compared to the results obtained with Hadoop and the data was easily accessible to anyone with a Google account and the right level of permissions. This first iteration proved that the new stack was more performant and easier to manage than the old one and gave us a solid case to continue building the platform.
Before going in further development we decided to take a step back and invest our energies in designing what would become the new Telegraph Datalake. The main goal was to democratize data access across the entire company and provide both insights layer and the possibility to perform discovery analysis on raw data. Also, it was really important to establish a basic data governance before progress further. We decided to avoid any monolithic approach and build our Datalake in multiple iterations reflecting the Agile philosophy that the company was starting to follow during that period.
After a few discussions we opted for the following structure:
- Raw data layer: Contains all the data that are ingested both in batch and real-time without any type of transformation. All the data sources ingestion processes are designed to be independent when possible in order to allow modularity.
- Shared facts and dimensions layer: Contains facts and dimensions that can be used across multiple projects with the purpose to provide a guidance on the usage of some specific metrics and dimensions.
- Serving/Reporting layer: Contains aggregated data and is built on the top of the previous two. Every project is designed in a separate silo and might use a different way to serve the data to the end user. Two projects don’t interact with each other and they are only sharing the two-layer below. They can be developed independently and in parallel even from different product teams.
Following this approach, our data platform was growing at a fast pace and for more than one year the team focused on ingesting all the available data sources, define data lineage for the main projects and orchestrate multiple pipelines to serve reports on time. We also started to build dashboards and microservices in order to serve a different type of consumers.
After a few months of work TMG Datalake was already storing terabytes of data, but it was not driving any action directly. The need to use data to improve how we interact with our users was clear since the beginning, but it was only when the platform reached the right level of maturity that we started to see possibilities to drive user experience on Telegraph website and mobile application from data. For this purpose, we developed specific data pipelines with the goal to improve user segmentation and use data in our CRM to provide a better experience to our users. This is still an ongoing process and what we can do with Adobe Audience Manager and Adobe Target is having a more significant impact with every day that passes.
Since the beginning of our journey we designed the Datalake to support real-time data processing, but we started to explore how to work with streams for more than a year afterwards. The platform was not mature enough yet and we were looking for use-cases able to bring real value to the business.
Then, in 2017, we started to think which area might benefit more from ingesting and processing real-time data and we moved our focus on the Telegraph newsroom. We began to run a few experiments using Google Cloud Dataflow, with the idea to consume website hitlogs in real-time. It was our first attempt to deploy in production real-time data pipeline and support with that a product that was supposed to be in front of everyone in the company.
A few months later in the newsroom our first real-time dashboard was live. The dashboard (below) displays, with a minimum delay, which type of content leads to new registrations and subscriptions in order to assess which articles and editorial channels are performing better than others.
All the numbers displayed here are for illustrative purposes only.
At the end of 2017, the data platform was still growing and reached more than 200 data pipelines to orchestrate and run, some of them in real-time, others on a daily or weekly basis. Even the stack of technologies (below), which in the beginning was restricted to a few tools, was growing to cover all the use-cases and satisfy different areas of the business.
It was time to introduce some changes, refine our work and reconsider some decisions that we took at the beginning of our journey. We had always worked trying to follow certain standards, but no one was forcing us on the right path if not for common sense and QAs. For this reason, we adopted Giter8 as a templating tool. This allowed us to automatically generate the skeletons of our pipelines and microservices in order to standardize as much as possible every product.
We also started to observe with interest other areas and held discussions with different engineers at the company, to find new ways to improve our day-to-day job. Continuous delivery and continuous integration was already a hot topic at that time and we started to think how to apply what normally is used for microservices to data pipelines in order to deploy in production with the minimum effort an high-quality product.
In 2018, our Jenkins CD pipeline was ready to be tested and we were able to minimize dependencies from scripts, local machines and deploy artifacts in production with a controlled and standardized process.
A few months later we also decided to reconsider the usage of virtual machines. We were already using GKE (Kubernetes) to deploy microservices and it seemed natural to move in the direction to dockerize our data pipelines and orchestrate them as Kubernetes batch jobs.
For this reason we spent most of the summer to refine and improve our templates and prepare our infrastructure to manage docker images. In September 2018 we finally closed the loop deploying in production the first data pipeline on Kubernetes.
It is now almost the end of 2018 and we recently finished defining the direction that we want to take for the next year and where to focus our efforts in order to improve our platform.
The plan for 2019 is to focus on:
- Improving our Python ETL library in order to increase development speed, code reusability, and stability of our pipelines. Once ready the library will become open source and shared with the entire community. All the tools that we use to interact with cloud resources in GCP and AWS will be available for everyone. In this way, we hope to receive feedbacks, contributions and make the same journey, that we started a few years ago, easier for who is approaching the cloud now.
- Decommission Azkaban in favor of Google Cloud Composer (Airflow) in order to remove a single point of failure from our solution and migrate to a better-supported technology.
- Release templates for Apache Beam pipelines and standardize the deployment process.
- Improve monitoring logging and alerting with a closer integration with Slack, Stackdriver, and Jira to automatically raise trackable tickets.
- Start to provide ETL tools and infrastructure to other teams at the Telegraph in order to help them automate their day to day job in a secure and resilient environment.
During these years we built our data platform like a puzzle, every day adding a new piece and sometimes removing one that was not fitting properly. This puzzle is not finished yet and it will never be. There is always space to improve what has been done in past, new ideas to follow, new technologies to explore. But in the end, what it really matters is to see people excited when a new product is live or they can start to play with a new piece of information and make better decisions based on it. This is the best reward for all our efforts.
Stefano Solimito is a Principal Data Engineer at The Telegraph
Connect on Linkedin