Towards On-device Offline Training Part 1

Abhishek

Published in

Holler Developers

8 min readFeb 2, 2022

AI Research at Holler Technologies

Introduction

What do we do?

At Holler, we aim at enriching your Conversations. We do this by enhancing the text our users write, with Stickers and Gifs. To do this the right way, we need to suggest content in appropriate situations, with the intent that it is shared by the users during their chat. This leads us to an array of algorithms that are a part of our in-house Recommendation Decision System.

These algorithms can be divided into two major blocks, a Generalization block and a Personalization block, While Generalization aims at suggesting a Stream of content appropriate for the situation, it doesn’t account for the user preferences, Personalization, on the other hand, aims at ranking the general stream and curating it for the individual user.

We aim at achieving the said personalization On-device, this helps us with getting results live and offline. There are a few blocks that need to be understood to dive deep into the On-device models.

Next, we talk about the various Sources of Sticker Content, the Streams of content, 2 on-device Storage data structures, and Interleaving(the first block of algorithm that resides on the device and aims at personalization.)

Sources

Branded Content

This is content created for specific campaigns for partnering brands. Although this content is directly tied to a revenue stream we at Holler do not bombard the users with so-called “ads”. Instead, the branded content is interleaved with Organic content to avoid brand saturation for the user.

Organic Content — Historical

This is organic content that the system has published before. We have a lot of user interaction data on this content and hence Server Side Machine Learning models can be applied to the same and they serve as data for the Generalization Block.

Organic Content — Newly On-boarded

The creative team always comes up with new content on a daily basis. The aim is to get this content to the users as fast as possible and to garner as much reaction(shares) as possible. This helps in making an informed decision of what the users want and how receptive they are to the new content.

Streams

We define a list of content as a stream, there are 2 kinds of streams with respect to each device

Incoming Stream

The Stream of content that is suggested to the user after a situation is triggered is called an incoming Stream, a very basic version of a stream that stores only the source of data looks something like this. Where A == Branded, B == Organic

A B A B A B A B A B

After the user has interacted with the Incoming Stream a corresponding Stream that goes out of the device is called the Outgoing Stream. Along with the Source of content, this Stream also stores the info regarding user interaction, namely view or share, and looks something like this

Av Bv Av Bv Av Bv Av Bs Av Bv

Here the s associated with B suggests that organic content was shared during this interaction.

Storage

The next aim is to temporarily store user interactions on the device that could be then used to make decisions instantaneously, without leaving the device. This means that this data storage must be memory-efficient{should not strain the memory resources on the device}, time-efficient{should be easily available to be utilized for instant recommendations} and should adapt to a cold-start problem.

Some streams worth storing could be

Source of Content using min-sketch counters and bit vectors
The content itself, using bloom filters
The message, using Locality Sensitive hashing

Let's draw our focus on the first point and shed more light on the 2 data structures namely, Bit-vector and Min-Sketch Counter

Bit vector

We want to store the information, like the one we saw above, i.e. what source of content did the user share during their last interaction.

A bit vector is an array data structure that saves bits in a compact fashion, It is effective in determining the absence or presence of a particular entity. What we want is to toggle the values of the vector between 0 and 1 based on whether a user has shared content from a particular source during their last interaction.

Min-Sketch Counter

What we want is to get the number of times a user shares the content from a particular source, we can use a hash table that is a simple counter and get away with it when we have only 3 sources of content for a short stream, but as this number increases so will the space complexity of our hash table.

Keeping this in mind we implement a Count Min-sketch, which is a probabilistic data structure that stores a frequency table of events in a stream of data.

There are 3 steps involved in its creation and use,

Initialization: a simple matrix (d X w) is initialized to 0, where w is the width, which can correspond to the number of content sources while d is the depth and is equal to the number of hash functions.
Increment: As the stream begins to come in, the element goes thru h number of hash functions, and the corresponding locations are incremented by one. This goes on till the end of the stream.
Retrieval: now we retrieve the counts of each source which can be used to inform our model and give a snapshot of the same to the algorithm that uses it.

Interleaving

To interleave is to present results from various sources in a fashion such that each source is given an equal opportunity to perform in front of the users. For instance, Sticker content at holler comes from three major sources, branded(A), organic historical(B) and organic newly-onboarded(C), a very basic interleaving would present the results in an equal fashion like

A B C A B C A B C

In order to improve upon this basic approach, there are 2 models which we try to implement around the concept of interleaving.

Static Interleaving

This approach interleaves a ranked list of Historical content with fresh, newly onboarded content. It is static because the interleaving takes place in a fixed ratio of 2:1 and is given to all users. This ensures a chance for the newly on-boarded content to show up to the users instead of being lost at the end of the list.

The benefits of this approach are sure shot but limited, It works well since the count of newly onboard content is usually low compared to Historical content.

For 2 newly on-boarded content static interleaving produces the stream below.

B B C B B C B B B

Once we back the Interleaving with storage and help it learn we get a smart version known as Dynamic Interleaving.

Dynamic Interleaving

We take the concept of Interleaving a step further by making it work on-device, making it personalized for the device user, and helping it learn.

Now, the task of Dynamic interleaving is two-fold,

First, we want to shuffle the order of the sources keeping the ratios intact, this is based on the very last response of the user. The logic behind this is to make the user feel a sense of personalization in the fact that the responses are immediately adapted to user preferences.

For instance, if we start with the first stream a particular user sees

A B A B A B A B A B

And if the user shares B . The outgoing stream looks like

Av Bv Av Bv Av Bv Av Bs Av Bv

The shuffling component use this info and for the next incoming stream we would switch and it will look like

B A B A B A B A B A B A

To favor what the user prefers, albeit temporarily.

And secondly, we want to change the ratio of content of the sources based on the count of the Short Term responses of the users.

For the incoming stream

A B A B A B A B A B A B

If A is preferred on a consistent basis, for instance, the last 10 times have seen a 9:1 or 8:2 in favor of A, then A is preferred more than B, and thus the next time the suggestion becomes

A A B A A B A A B A A B

and so on with the source flipping also into account.

Dynamic Interleaving for Sources of Content.

Let us dive into the 4 step process

STM (short term memory) storage using a Bit vector
Shuffling component
SCM (short count memory) using MinCountSketch
Preference Weightage

Example : For three sources A, B, C

{This is a snapshot of results stored in the 2 data structures and not the data structures themselves}

STM == A:0,B:0,C:0

SCM == A:0,B:0,C:0

IN_stream_1 (this is what the user is given when they ask for a recommendation.)

A,B,C,A,B,C,A,B,C,A,B,C

OUT_stream_1. This is what the STREAMS component on_device receives after the user has interacted with IN_stream_1.

Av, Bv, Cv, Av, Bs, Cv, Av, Bv, Cv, Av, Bv, Cv

STM == A:0,B:1,C:0

SCM == A:0,B:1,C:0

IN_stream_2:

A,B,C,A,B,C,A,B,C,A,B,C

Checks STM and SCM

The shuffling component activates and reshuffles the stream::

B, A, C, B, A, C, B, A, C, B, A, C. The STM is then washed out.

Preference weightage remains inactive since it waits for a certain threshold to be touched before making a decision.
Assuming the threshold for SCM is 5, once a single source hits 5(say A) the ratio of that source is doubled

For, STM == A:1,B:0,C:0

SCM == A:5,B:1,C:1 the stream would look like

A, A, B, C, A, A, B, C, B, C, B, C

Wash-Out, Both STM and SCM memory run the risk of being erased or wiped out owing to some reason or the other, while STM is agnostic to this behavior since only the last thing is what matters, the SCM can use the Call-back where at the end of the day each device reports the sum values and thus can be refreshed and resent to the device in the event of an SCM wash out.

Thus with the help of shuffling component and preference weightage, a unique combination of order and ratio is achieved for each device. This personalization is achieved by all data that resides on the device, thus creating self-sufficient on-device systems.