The Financial Times has always relied on facts and data to deliver the highest-quality journalism to our readers. The data-driven culture has always been part of the company values. Therefore how we manage our data internally within the organization is very important to us.
Fundamental to this is to have a dedicated central platform for telemetry and data management as part of our FT Core technology group. This is a group that is part of FT Product & Technology, owning three of the central technology platforms. They are powering up our customer-facing products spanning content publishing, content metadata, paywall, and analytics data.
My team is responsible for the Data Platform. This is the platform for telemetry and analytics data and our mission is to deliver reliable data with high quality in a timely manner to the internal users and teams at the FT to enable decision making and new product development.
We have a very big and diverse group of direct users and indirect consumers using and benefiting from the valuable data our platform collects and stores.
- Financial Times board members — for strategic and tactical decision making.
- Marketing teams — for campaign design and planning to acquire and retain subscribers.
- Editorial teams — for monitoring the performance and the readership for the articles and content they produce.
- Advertising teams — for identifying sell/target subscribers groups for different products.
- Our Product teams — to design better products for the FT readers and drive personalization to help to acquire and retain them.
- Analytics, Business Intelligence, Contact Strategy, and Data Science teams are among the main direct users of the Data Platform, using the data to conduct analysis, build dashboards and reports, and train models that are then widely used across the FT.
Let’s review several use cases for the data delivered by the FT Core Data Platform.
Power up Analytics
At Financial Times we use a variety of business metrics to better understand the impact and the opportunities for future growth. Some of them are around engagement. It is absolutely essential to understand how engaged our readers are. We do that by using a metric called RFV (Recency, Frequency, and Volume).
And what we are looking at is for every single reader:
- When did they last come to our site? — this is recency
- How often do they come in any given time period? — this is frequency
- How much content do our customers read when they come to our site? — this is volume
Based on RFV we can determine the score of engagement for every single reader.
Power up Retention
As well as a user’s current level of engagement, it is useful to know who may be about to become engaged or disengaged. The Data Science team at the FT is developing and training RFV predictive models to identify individual and corporate subscribers that are moving from engaged to disengaged over the next 4 weeks and vice versa. Based on the results provided by these propensity models our Customer Care team is able to contact the readers that are likely to disengage and in many cases, successfully retain them.
Envoy is an internal decision engine used across our customer-facing product line and helping in building smarter products. This engine uses the results from the same models to consistently target predicted to disengage subscribers by offering personalized newsletters and predicting the next best action for them. This is another example of the data-driven culture at the FT. And the raw data comes entirely from the Data Platform.
Power up New Product development
When new readers subscribe to FT.com, they provide information about the industry they are involved in. We store it as part of our arrangement models within the Abstraction Layer. It is further used for enabling a variety of convenient features like personalized topic recommendations as part of the subscriber’s personal myFT feed page. This is helping our readers by saving their time in finding relevant content based on personal preferences.
Power up Internal Tools
Lantern is an internal monitoring tool powered by data from the Data Platform. It is an editorial focussed tool that provides analytics for the content Financial Times is publishing. It is used by Editorial teams to monitor the performance of the content they are producing. The main metric used is Quality Reads which is calculated within the Streams Layer of the Data Platform and further provided for consumption to Lantern with minimal latency.
How do we support all these different use cases?
We ingest clickstream data for the usage tracked at the web sites and the mobile apps for the digital version of Financial Times titles. That data is streamed nearly real-time via the Data Platform Stream Layer for further data processing. In addition to FT products usage data, we also ingest data from internal and external data vendors, for example, our internal platform managing the subscribers’ membership, the platform for the content our journalists are publishing with its metadata; as well as some external systems like Zuora for payments and Salesforce for corporate subscribers contracts.
All the data is stored with minimal latency within the Data Platform Data Lake and further used for building the Abstraction Layer where we generate valuable data models providing insights to the business departments. Clear and well-established data contracts provide those insights to the Analytics Layer where multiple audited metrics are calculated and ready to be used by the internal business departments. The same insights feed the generation of a variety of data science models.
Democratizing the data
All the way from Streams through Data Lake to Analytics Layer through Abstraction the Data Platform ensures data is highly secured, validated, and with the highest data quality. We ensure that the insights we are generating are not revealing any personally identifiable information and thus are ready to be democratized and widely used for decision making and ready to drive growth.
To ensure data democratization we provide tools for better visibility and understanding of the data along with some self-service enablement for easy data access and the ability for building new data workflows easily. We are constantly working on extending the platform with new capabilities to enable the rising demand for data, be able to generate more and more valuable insights and power up machine learning and various dashboards as part of business intelligence.
How are we building the Data Platform?
The FT Data Platform lives entirely in the Cloud. We are building it using AWS managed services by preference as we are reducing the operational cost significantly.
Our big project now is to consolidate all our Stream Layer services to ingest new data via AWS MSK with Spark jobs and Apache AirFlow for the workflows orchestration. Currently, we are using Kinesis streams, SNS and SQS but our research shows that with the newly proposed approach we will have better scalability and more effective cost management.
Our data is landing and stored in the Data Lake. Now we are working on standardizing the formats by using the Parquet in S3 where it will be available for reading via Redshift Spectrum. Our Abstraction and Analytics Layers reside in the main Enterprise Data Warehouse using Redshift. For virtualizing the variety of underlying formats and data stores we plan to use Presto service. As an initial evaluation, we plan to use our own deployment of Presto within AWS (aka “Vanilla” Presto). Long term if we prove that the concept works for us we may migrate to some of the managed Presto services.
As a foundation of the platform, we are moving now from EC2 and ECS to EKS. We anticipate this migration will increase the platform’s scalability and operational cost-effectiveness significantly. For the data lineage, we are planning to try Apache Atlas. For monitoring and observing the system’s health, we will continue relying on Grafana, Splunk, and our internal monitoring platforms BizOps and Heimdall. For monitoring the data quality and automating this process across the internal FT systems, we built and integrated within the platform a homegrown Data Quality Metrics Checks framework.
How are we enabling the data to be consumed?
For enabling our tech teams to easily build new workflows, and our product teams to do product discovery and development more effectively, we recently developed a new capability. We named it internally E2 which stands for Execution Environment but relates nicely with Euler’s number. It is a processing engine for different algorithms like Data Science Models and Machine learning on top of the data we have in our Data Lake. Our ambition is to support scalable workflows written in a variety of programming technologies like R, Python, Java, and Spark.
We will continue sharing our adventures and challenges in building our core platforms using the latest technology trends. Stay tuned and if you recognize our mission as yours consider joining us in this exciting journey!