A Parallelized Approach to Audience Creation in Facebook using Kafka

Nick Nathan
unified-engineering
5 min readSep 3, 2019
Photo by Dan Gold on Unsplash

Brand marketers and marketing agencies want greater insight into who their customers are and how they can more effectively target various customer segments. Unified helps to solve this problem by using a brand’s anonymized customer data to create audiences on social media platforms. Information from the social platform about these audiences can then be combined with third party data and analyzed to better understand a brand’s customer profile. To make this possible we must ingest and store millions of customer records from our marketing clients and then upload these customer records to social platforms. This presents a complex technical challenge both because of the data scale but also because of a heavy dependency on third party platforms and their APIs which may or may not be reliable or performant.

The focus of this post will be on our attempt to build a performant solution to the problem of audience creation in Facebook. Our first attempt was straight forward from an architectural perspective and represents the type of solution that comes from the need to deliver value as quickly as possible before the full scope of a problem is understood. First we ingested our marketing client’s customer data into our data warehousing solution, Amazon Redshift. The next step was to build a python application which we named the “Audience Generator” that would query customer data via another python based API and then load those customer records directly to Facebook.

This initial solution was sufficient for creating audiences for our early customers however as the business grew and the data scale increased the application became increasingly fragile and runtimes ballooned from approximately 8 hours to more than 60 hours. Furthermore, when errors did occur, if the application crashed the only option available was to restart the entire process from the beginning. This meant that a failure at hour 45, for example, was devastating because there was no good method to salvage the work already completed. It quickly became clear that another solution needed to be implemented to ensure stability, reliability and improve performance.

The success criteria defined by the engineering team for an alternative approach consisted of two things:

  1. Improve the runtime for the audience generator application so that it would complete in under 24 hours for our largest clients.
  2. Improve the application stability so that in the event of failure it could resume work from where it had failed.

To achieve the first objective, the team had to completely re-imagine the way the application was architected. We identified that the primary issue with regards to performance was that each audience was created sequentially. Because of constraints imposed by Facebook’s API only 5,000 customer records could reliably be uploaded to an audience at a time. Many audiences however had millions of customer records meaning that it could take several minutes to create a single audience. If each audience took on average one minute to create, 3,000 audiences would take approximately 50 hours.

Because there was little the team could do to reduce the time to create a single audience given our dependency on the Facebook API we decided to parallelize the entire process. To make this possible the engineering team refactored the audience generator to treat the act of creating a single audience as an isolated unit of work represented by a JSON object. These audience creation tasks defined as JSON could be constructed based on some predefined configuration as well as the customer data itself and then loaded onto a Kafka queue for processing. Once in Kafka a group of generic consumer processes would be waiting to read each task off the Kafka queue and execute them independently of one another. As a result, instead of utilizing only a single worker the team could deploy an arbitrary number of workers acting simultaneously. Finally, to address the massive increase in data scale the team migrated the customer data ingestion pipeline into a data lake. This not only helped us to reduce the cost of data storage for existing clients but gave the team the ability to scale the application to new customers without worrying about performance degradation.

To achieve the second objective, the team needed a way to track the progress of the application and preserve any work done in the event of failures. The initial iteration of the application stored the results of each audience creation task e.g. metadata about the audience sizes and unique identifiers retrieved from Facebook, on a flat file on the application server which it would then copy down to an S3 bucket upon completion. Once loaded to S3 the flat file was ingested by a data quality script which could verify that all audiences were created successfully using the correct customer data. This was a problem because in the event of a failure the application had no method to read the flat file, identify the missing work done, and then restart the generation process for only those missing audiences. Furthermore, there was no way to identify problems that did NOT crash the application because the team had to wait for the entire process to finish before any data quality work could be completed.

The solution was to replace the flat file with a database table and record the status of an audience as it moved through the creation workflow. This way, in the event of a failure, the team could query the table, identify any incomplete audiences and create a new message that could be loaded back on the Kafka queue. By using a table instead of a flat file this gave us greater visibility into the workflow and set the stage for implementing a status tracking UI. Finally, the team build a secondary Kafka queue used to store messages which could be used check the quality of audiences as they are being generated. Once an audience was created, a new data quality check message would be created and loaded immediately to this secondary queue so that we could identify issues early in the workflow if high numbers of audiences were failing their data quality checks and the entire process would complete faster.

Once complete the total workflow runtime came down to less than 16 hours. In addition, the application is far more resilient to failures both internally and from Facebook. At the highest level this was an interesting exploration into building parallelized, scalable application architecture. If you’re a developer looking to join a team working on interesting problems at the intersection of technology and marketing be sure to check out Unified at https://unified.com/about/careers-and-culture!

--

--

Nick Nathan
unified-engineering

Building apps and technical infrastructure for startups and growing businesses.