Designing & Deploying a Web SDK — Part One

Published in

AppsFlyer Engineering

11 min readApr 28, 2020

Intro

I’ve been a developer on AppsFlyer’s PBA (People Base Attribution) team for the last couple of years, where what we are trying to achieve is pretty simple to explain, but eventually much harder to deliver. AppsFlyer is a big data company, which means that we receive a hundred billion HTTP events daily, parse and analyze these events, and eventually present them in dashboards — pretty basic.

The PBA product’s goal is to make multi-platform attribution possible, essentially by attributing actions and conversions from all mobile applications and web platforms, regardless of the user’s platform (mobile, desktop, tablet, smart fridge or even a spaceship), to their respective marketing campaign or channel, if one such exists.

The mobile SDK is AppsFlyer’s core product since the day of the company’s first commit. With this new ambitious goal in mind, web equivalents of our mobile attribution products were now needed, including a web SDK, a way to intelligently and efficiently combine both data sources and analyze them later into a meaningful dashboard, as well as raw data reports for our clients to use.

In this blog post, we will outline the journey of designing and building our web SDK, the challenges we encountered, and how we overcame them. This blog post has two parts, each section its own purpose.

Part One how we overcame the challenges of actually writing an SDK that works and delivers the goals we set out to achieve. This part will discuss designing an SDK project the right way and what are the pros and cons of each design option.
Part Two building an automated deployment process with multiple dependencies and safety mechanisms to prevent deployments conflicts. This part will discuss the CI/CD deployment process that we created for the project.

Getting Started

As we set out on our task, we soon found out that building an SDK from scratch is not as easy as one would think. Our top level goal was to create a fully operational SDK project that each client (web application owner) can implement in their own web application and start sending events to AppsFlyer. On top of the business logic layer, we needed to create a project that can support our development needs and should allow developers across teams to push new features to production environments in a safe way.

Our main considerations in the design were:

Safety (the most important consideration) — The SDK should not create latency with the page load time or cause the page to break in any way. The SDK should load in an async operation to the page, and when possible (have the resources needed) to send the events.
Static snippet — The client-side snippet must be static and shouldn’t change. Making “in-house” changes to your services is one thing, but asking clients to change something on their end is another thing.
Fetching an SDK from the web application page demands some actions from our clients. If we fail during the design phase, the new project will probably need a new snippet to support its features requiring a change on the client side. When you have a huge amount of clients that need to update their page — it’s going to hurt.
Infrastructure — we should create an infrastructure that will be able to support our traffic growth over the next couple of years. All the building blocks that we use should support our needs for future traffic growth.
Browser support — we need to write an SDK which is as vanilla as possible, to make sure we support as many browsers as we can. As you may already know, different browsers support different versions of Javascript (ES versions), therefore, we need to support as many as possible to be able to receive the data from a diversity of browsers.

With these goals in mind, once we rolled up our sleeves and got to work on the project, we quickly encountered quite a few challenges on this journey.

Lack of browser standards — If you have some experience working as a web developer, you’re probably aware there are so many browsers, versions, and features to support that building an SDK is a HARD task. Cookie generation and maintenance (e.g. sameSite and Chrome 80 security updates), local storage usage, different ES versions to support — nothing behaves the same for all browsers.
Runtime and exceptions — As an SDK developer you must understand the philosophy of the project you are a part of — “You are not the owner of the page, you are a guest on the page” and that changes everything. No delays to page load time or performance are acceptable, even more important, you cannot cause the page to break in any way. This is not your home court that you can play and take the blame. Everything needs to run smoothly, leaving no room for error.

If all this isn’t enough, let’s add more spice or rather complexity.

At AppsFlyer, we have multiple products that demand a web SDK implementation that will work with HTTPS events coming from web browsers. One of them is our smart banners product (maintained by the engagement team). Each team has its own tasks, and each product has its own business logic to execute on the client’s web application.

The Building Blocks

Let’s get to know the building blocks that we need to assemble together to make the SDK work.

Snippet

The client-side snippet is a piece of code that we provide to our clients to set in their page header. It’s basically a small snippet of JS (Javascript) code, that on page load time, calls our CDN (explanation below) that serves the actual SDK code.

The snippet doesn’t do much, only the bare minimum, because it will be embedded on the client’s page so you don’t want to change it too often.

Such snippet will probably look like this:

CDN — Content Delivery Network

We use Akamai’s CDN services that receive HTTPS calls as input and return our SDK as the response to our clients’ web applications.

Cloud storage bucket

It doesn’t matter if it’s an AWS bucket or Azure bucket, or any other service in the world of cloud data storage, you must have a publicly hosted bucket to serve the SDK files and to provide access to the CDN to access those files.

SDK

The SDK essentially encapsulates the business logic layer, or how the application should behave as the client interacts with it. This can be anything from an API call to fetch some data and parse it, or it could be applying CSS files after analyzing the device type by the user agent.

Web event handler

The SDK ultimately triggers HTTP/HTTPS events to carry out its actions. Creating a web handler to receive the events, parse them, and finally write them to some database or other queue mechanism. Now you have the events and all the data you need to, later on, analyze and calculate your business needs (such as calculating ROI or marketing spend).

Knowledgebase

Good documentation creates an excellent experience for the user. In order to ensure that your product is equally usable as it is well-engineered, you should make it a practice to well-document your product, and update the documentation with every release. This makes it possible for users to troubleshoot common issues themselves, and offload these questions from already heavily burdened support staff. At AppsFlyer we take this responsibility seriously and have an entire awesome team dedicated to writing and updating our documentation frequently. Find it here.

Designing an SDK Project

Designing this kind of project is not easy, each infrastructure change involves some contract interaction (by the legal department) or client side implementation (snippet change). Eventually ensuring that all of this comes together and works across 10X and ultimately 100X scale requires design thinking, and investing the time in good design practices from the very beginning.

First initial implementation

Our first “go at it” involved creating stand-alone projects for each team (if you recall we have multiple in-house SDKs that communicate with each other), so that each team is responsible for its own SDK, and this required each team to write & handle everything from scratch — from the business logic through the operational overhead (routes, ELB, cloud storage etc.)

Eventually each project consisted of:

Stand alone applications
Fully isolated projects
A-Z responsibility
End-to-end operational responsibility

Design diagram:

The benefits of this design worked for us for a while. It enabled greater autonomy and development velocity, and in a short period of time we were able to push new features to production. There were no dependencies on other teams, or concerns about breaking someone else’s service.

This approach didn’t fit our goals.

Different projects different snippets — The snippet was not static. This requires each SDK to fetch its own snippet, while the browser’s resources are limited. This may require the client to fetch multiple SDKs or packages, where each call is an HTTPS request that takes resources to execute and parse, increasing latency and load time.
Operational overhead — Each team had to deal with a lot of operational overhead for their own dedicated resources: CDN, routes, HTTPS and code snippets on the client’s page to load the SDK at call time, which possibly could have been served on shared infrastructure & resources, reducing operational overhead. This way simply doesn’t support our growth.

Aside from the noted downsides of working in silos, this design eventually revealed a major flaw, it created numerous race conditions that affected the safety design goal — cross of a red line for us.

Let’s think about the following scenario — two SDKs running on the same website: “SDK team blue” and “SDK team yellow”. Both SDKs have only one function, to set a cookie in the browser’s memory (as you might know, a cookie is saved under a domain value and cannot be accessed within other domains scope). Each team needing to write its own team color as the cookie value.

In the simplest scenario possible, a race condition was created.

If the yellow team’s SDK loads first, the setCookie function will create one with a “yellow” value. If the blue team’s SDK loads next and it inserts the “blue” value, the end result is that the browser’s cookie will have “blue” as the value in the browser’s memory.
Of course the opposite is also true, if team blue loads first, and then team yellow loads after, the end result will be different. Eventually there will be one sole record, that will change according to which SDK loads first.

New project. New design.

At this point the complexities encountered, as well as the downsides to the previous approach, drove us to design a more modular system with these issues in mind. This led us to a design built on a “base SDK”, to be shared by all teams, that serves web SDKs with the purpose of managing the utils, infrastructure, and any other shared resources that might arise.

The new design:

Base SDK with plugins
Each team develops their own plugins (for the business logic layer)
Shared infrastructure (routes, ELB and more)

With the guiding principle that this base SDK will provide safety measures to prevent the conflicts encountered with siloed SDKs, each team provides the relevant plugins with their business logic to ensure all scenarios are accounted for.

The new design looked like this:

Benefits of the new design using one shared base SDK:

Safety — Now that all teams use the same utils function to write, manage, update and delete the browser’s resources, this can be done in a safe way once, and all the teams use these shared created utils.
Browser resources — The amount of HTTPS calls from our clients’ web applications was reduced to one call, significantly improving the impact on the browser resources.
Operational overhead — The route from the client’s page will always be the same and will never change. If the client wants to add a plugin to enable new functionality, all they need to do is add a query param to the call.
Permutations — With this new design of a base SDK and plugins to serve the teams’ business logic, this created the need for SDK permutations files to support the different combinations, in order to be able to supply each client with the amount of products that they need in one HTTP call. For example, if there is a base SDK and two additional plugins, the following would be created in order to support all possible scenarios:

The New SDK Implications

Now that we reworked the design, we had most of our bases covered, however, one question remained unanswered: What about deployments (CI/CD)?

CI/CD is critical to the development process as it allows freedom and safety for the development teams involved by automating processes that without it could introduce conflicts with multiple users. On top of that, it ensures formerly manual operations are performed canonically the same way automatically, reducing human error significantly.

Our goal was to create a CI/CD flow with the following considerations in mind:

Create all the possible permutations automatically — Since the SDK is composed from a base SDK layer and plugins, all possible permutations need to be created, so that the build process is triggered for all layers and plugins.
Cloud Storage — We needed to create a pipeline that pushes all the permutations to our storage bucket.
Purge CDN Cache — The CDN contains many caching layers that need to be refreshed in order for them to receive the latest SDK version, this requires a Docker command to run via the CI/CD process, or it can be done manually through the UI.

In the next part of this blog post, I’ll describe how building a robust CI/CD process enabled us to create new business flows, work seamlessly with QA for testing, and make sure the master is never corrupted.

Conclusion

In summary, what conceptually was simple, eventually became a complex project to design and engineer. If initially, we set out on this journey as a standalone project — we quickly learned why working together and sharing code with other teams actually improved our processes and reduced operational overhead, as well as provided a better experience for our users through improved speed and less code.

It also served to remind us that you shouldn’t be afraid to try and reiterate or fail fast, many times the greatest learning is in the parts that didn’t work, teaching us how to do things better the next time around.