Building the future of knowledge sharing — A closer look at Lunyr’s advertising system

Note: This information is outdated. Much of the system has been redesigned and improved.

Time is moving forward and so is the progress on Lunyr’s components. Today we’ll take a closer look at the Lunyr advertising system.

The Lunyr ad system uses powerful technologies like IPFS and Deep Learning. Like BitTorrent, IPFS uses a Distributed Hash Table as the underlying technology for decentralized data storage. Deep Learning is a branch of machine learning that uses neural networks with multiple layers (like your neocortex) to learn hierarchical representations of data. You can do lots of cool stuff with deep learning, from image recognition, to simulating hallucinations, to magically applying an art style from a painting onto your photos. See this video for a presentation on some of the more philosophical aspects of neural networks.

So let’s get down to the design

What does it look like behind-the-scenes when advertisers participate?

Advertisers spend LUN to purchase impressions on pages that are matched with ads for relevance. They submit a quadruplet (A,K,B,G) where

  • A is a textual ad
  • K is a list of keywords with which they’d like to be associated with that don’t appear in A
  • B is the maximum amount they’d bid, in LUN per 1000 impressions.
  • G is the total budget for ads in LUN

Keep in mind that LUN are divisible up to 18 decimal places, so you can send 0.123456789123456789 LUN if you want. Advertisers call the LUN Pool contract on the blockchain, giving the hash of the ad+keywords as well as their bid, and they transfer their budget G of LUN to the LUN pool. (Advertisers must purchase LUN to advertise on the Lunyr platform). Advertisers also send (A,K,B,G) to the ad auction module, which cross-references the blockchain, and then computes an ad rank

Using our word embedding model, we can associate each document (collection of words) with a vector whose distance from other document-vectors represents its semantic similarity to those documents, so we can define a function relevance(doc1, doc2) that returns a number indicating the relevance. We can use this function on A concatenated with K (A | K) and each web page to get a relevance score. We then combine this relevance score with B to determine the rank of the ad.

Impression price = the amount your nearest lower competitor pays / your quality score) + a small number.

The content pages will have JavaScript that pulls the latest ad for every impression. The JavaScript will also call out to the ad layer to report clicks on ads. The ad layer will be running analytics that advertisers can view to understand their ad performance. When the advertiser’s budget for impressions has been exhausted, the ad is no longer served. Ad-ranks are re-evaluated regularly.

How do we update content?

IPFS has a name service IPNS, similar to DNS. Basically it inserts a layer of indirection between DNS and IPFS, allowing us to make a permanent DNS record pointing to our IPFS node’s id, ->

And then we can have an association between our node’s id and the latest content, which we can update very easily:

ipfs name publish <new content hash>

When anybody requests our node’s id, IPFS will automatically search for the content hash associated with that name that has the largest sequence number (i.e. the latest).

What is word embedding and why do we want to use it?

The idea behind word embedding is that words that are used in similar contexts probably have similar meaning, so if we train a neural network to recognize when words are in context and out of context, then that network will encode a lot of semantic information. The reason this works is that the notion of context is really flexible, and simply represents what the geometry of the media-vector-space *should* look like locally. So if we have known matches of context (via peer review), and it includes images relating to text, then we could train another neural network to associate vectors with images that are close to the word vectors for words that are in context and far from word vectors that are out of context. We can do this with any media. We can even do it with other languages by taking known pairs of synonymous words and treating that as the notion of in-context.

In a nutshell, this technique is both state-of-the-art and very flexible. It was developed at Google as a marriage of old Natural Language Processing (NLP) ideas with new neural network ideas. It has been demonstrated that the word embedding technique doc2vec does very well on identifying duplicate questions in Q&A forums. This is essentially what we want to do: determine the similarity between bodies of text.

How do we use word embeddings?

We have a database of each document, its IPFS hash, its last hash (documents are edited), its latest vector, and the model version that produced the vector (the model may be retrained). Additionally, we have an R-tree (or something like it), which is a way of storing a large number of vectors in a hierarchy of rectangles in such a way that it makes it fast to look up the nearest neighbors to a given vector. We compute the vector corresponding to the text ad, and then use the R-tree to look up the nearest N neighbors to that vector. We then look up the document hashes corresponding to those vectors in the database. This gives us a list of the N most relevant pages for that ad, which we can sort by distance from the ad vector. The quality of an (ad, document) pair is essentially how close together their vectors are in this vector space. This then gets combined with the bid amount for that ad and the bids for other nearby ads to determine the ad-rank. We store the word embedding model on IPFS so anyone who wants to audit the process may do so. We periodically recompute the vectors for the documents to account for changing content.

How do the ads get served?

Each content page has some JavaScript that calls out to the ad layer, that tells it what ads to serve. The JavaScript sends the ad layer the IPFS hash of the content being served and the ad layer matches this with the latest results from the Ad Repository. If the ad is clicked, the JavaScript reports this back to the Ad Performance Module. It also reports that the page was served, so the Ad Performance Module can track Click-Thru-Rate = Clicks/Impressions and present this to advertisers.

What is the Ad Repository?

When the ad-rank is determined, several quadruplets are created, (ad-text, document hash, ad-rank, timestamp). These are stored in a database, and when a document is served, its JavaScript pulls the latest N ads for that document and displays them.

What is the Ad Performance Module?

The ad performance module records impressions and clicks for every ad, so that advertisers can view the performance of their ads.

What is the Ad Auction?

This is the engine that determines ad rank, so it interfaces with IPFS and can view the ads and bids that advertisers submit.

What is the LUN Pool?

The LUN Pool stores all the LUN that advertisers pay, along with newly created LUN. These tokens are distributed at the end of every pay period to Lunyr and the contributors in proportion to the CBN they earn.

The overall system design will be revealed soon. Stay tuned for more details.

See also: