Solution to the Scoble Problem — A Study in Stream Evolution

Published in

saipanyam

9 min readJul 12, 2011

I read Rocky Agrawal’s post on the Scoble Problem and the rejoinder to it “It’s not Scoble’s fault”. It is and will be an ongoing problem on any ‘stream’ (Facebook, twitter, g+ etc). Before laying out a possible solution to this, I would like to give a bit of a background. Please indulge me.

I have some experience in dealing with this problem. Till recently I was an architect on the stream team @ myspace. What was lost in all the noise of myspace’s fall from grace, is the cool stuff that was done by engineers there. Quite early in the design phase of ‘futura’ (redesign of MySpace to my___ ) , we were tasked with solving two major problems: Relevance and Stream clogging (now known as Scoble effect!). These are still unsolved in the sense that there is no universal or elegant solution. There are solutions which achieve the results to varied degrees. It is still an open problem.

Before I layout a proposed solution, let me talk a bit about Stream Evolution. Each evolutionary phase builds on top of the previous. Like evolution, each phase was born out of a need to survive the changing landscape of user needs and adoption. The frustration of users is valid in the sense that we want our machines to anticipate and mimic our real life. Our online world should reflect our offline world. Circles of g+, Groups of facebook, follow model of twitter are all different facets of human behavior (among others). From my perspective (Origin of Streams ! :) ) every stream product follows an evolutionary path:

Chronological Stream (CS)

This is the classic stream solution, where we show the activities based on a reverse chronological order. Newest ones first. As activities occur, we store them. Then based on a user pull, we display a paged view of the activities. This worked for a while as the amount of data was manageable. We could fit all the information in a few ‘pages’ which could be traversed easily. This was a single dimension solution

Categorized Chronological Stream (CCS)

CCS came about when the number of activities increased by an order of magnitude, since humans have a limited capacity to consume at one go, CCS attempted to divide it in to smaller pieces. It was an answer to the question: “How do you eat an elephant?” Answer: In small pieces. But the need was to be categorized rather than partitioned based on date time. So a solution was devised to categorized based on an extra dimension of ‘friend lists/groups/circles’. The user is given an option to filter the stream by groups of friends. This gave them the ability to consume only those parts, which interested them at that particular moment in time.

Bucketed Chronological Stream (BCS)

BCS was built on top of the previous two, by providing the ability to consume by ‘type’. It evolved to satisfy the need to consume in three dimensions: Date, Groups and Buckets (In this context buckets refer to media types). For e.g. a user wanted to see all Videos shared by friends (all or a select few), or see only photos, events etc. The need and the forcing function here was, the user knew what type of content to consume, but was not sure when and who provided it in the first place. It enabled discovery for the first time.

Real Time Stream (RTS)

RTS was the answer to the problem of immediacy. Users wanted their data now! It was push based model, instead of pull. Yes this evolutionary phase incorporated all the advantages of the previous phases. Now the stream was a fast flowing stream, a flood!!!. The innovation here was only in how to scale the infrastructure. Each solution in this phase had variations on solving scaling for more and more data. Twitter’s scaling problems come to mind. But more importantly user’s were not satisfied. They had this nagging feeling that they might have missed something interesting. So the solution joined the advantages all the above phases.

Aggregated Stream (AS)

AS was an answer to RTS’s inherent weakness of missing something important. The solution was to aggregate activities by type so that we can fit more data in a small space. The evolution of thumbnails, profile pics are a manifestation of this. This enabled one to increase the amount of data per screen item. The solution was limited only by the amount of data that we could aggregate in a specific time window. If a user uploaded 20 photos, we don’t need to show 20 single items, but as one item. The item itself contained jump off points for the more interested consumer. Aggregation is not trivial as it is often confused with collection. A collection is a special case of aggregation. Aggregation can span across multiple dimensions for e.g. Multiple friends + Same Type, Same User + multiple types etc. Adding the dimension of date adds more complexity to it.

Asynchronous Interaction Activated Stream (AIAS)

AIAS can be considered as parallel evolution. AIAS solutions are based on the premise that, any interaction that occurs on a piece of content in the stream needs to be resurfaced to the top as it is fresh. More the interaction, more it can be thought of as new content. Here the content that is added, is the interaction event itself. It is asynchronous as interaction can occur out of band. Different users will see the same piece of content at varying levels of sort order. The premise is also that if it is interacted most, then it must be relevant to all users. AIAS works only when there is a fast flowing stream, otherwise you would see duplication in the same view.

Till now we have see how we moved from Caveman phase (Chronological) to Agricultural phase (Categorized, Bucketed), to Industrial phase (Real Time), to Information phase (Aggregated, Asynchronous Interaction). We now examine the Digital phase of Relevant Stream and Diversity Infused Relevant Stream.

Relevant Stream (RS)

Relevant stream was an answer to the exponential data rates that we are encountering. None of the previous evolutionary adaptations could handle this. Even with better infrastructure, better and more hardware, it was realized that there is no chance of survival. As more and more companies were competing, the need arose to differentiate. My contribution was to come up with a relevance algorithm for real time streams. Since it is patented (Inventor: Me, Owned By: Myspace), I can only discuss the high level heuristics of any solutions that belong to this family.

Essentially any relevance solution comprises of tracking a bunch of signals, normalizing them and folding the resulting values in to a relevance metric. This relevance measure can then be used to sort the list of activities. Then we take the top n of the list and display to the user. We can combine the features of the previous evolutionary phases to improve the throughput. There are as many implementations of this approach as are companies. One more key heuristic is the use of serendipity. We introduce a random element to promote discovery. Otherwise, we will have the case of “rich getting richer and the poor getting poorer”. We would never discover new items. The same old people keeping showing up. Scoble effect is a prime example of this. Virality comes in to play because of serendipity. In fact nature by and large is a successful system, as it includes serendipity as a fundamental design principle. Relevance at first blush is a great solution. You would find an initial improvement in user satisfaction and then as more prolific/enterprising users ‘discover’ or ‘uncover’ the signals/algorithms, it is prone to gaming. If we can determine the signals, we can simulate behavior that games the relevance algorithm. We see examples of these in nature too. Insects having ‘big eyes’ on their backs to scare potential predators, camouflage etc. So no system is impenetrable or opaque. At best we can have a first movers advantage. If one doesn’t evolve it will soon be extinct.

Diversity Infused Relevant Stream (DIRS)

DIRS came about to make relevance and serendipity more effective. I might want to see at least some of my less frequent friend activities, than more of a prolific friend. I would trade the excess of something with what I have less of. In case of Scoble effect (Stream clogging), I neither want to turn Scoble off or keep him fully on. I want just enough!!!! Now just enough is not something that computers understand. We need to map something that is fuzzy to binary. There are of course many variations on this theme. There could be a facility to mute or volume control. But it soon becomes a pain to manage. We don’t have any visual cues to help us. It is always trial and error and that makes a lot of people uncomfortable. I went with a simpler guiding principle. Like previous (regarding relevance patent) I cannot discuss the entire solution, but I can share the guiding principle behind diversity algorithms, which are in the public domain.

Definitions:

If there are two users ui and uj connected to user C. Define Candidate(U, p) as the count of number of items from user U in pool p. Define Count(U) as the number of items from user U displayed in C’s stream.

Guiding Principle:

Stream Diversity: if Candidate(ui , p)>0, and Count(ui )= 0, then Count(uj)≥ k until Count(ui )≥1.

Outcome:

If there is an item from ui then we should show not more than k items from uj unless we also show at least one item from ui.

This diversity has been shown to increases ‘perceived’ relevance. Diversity underpins system level robustness, allowing for multiple responses to external shocks and internal adaptations; it provides the seeds for large events by creating outliers that fuel tipping points; it drives novelty and innovation.

We use an algorithmic way to simulate the real world. Like all simulations, it is only an approximation of reality. If we look hard we can perceive cracks where reality seeps in. The famous Red Pill or Blue Pill scene from Matrix is a good metaphor.

Improvements in these areas are possible by increasing diversity. Here too we would reach an asymptote soon. We also need to watch out for diversity turning in to randomness and chaos. The good news is we haven’t reached the plateau yet. More can be done in this area. My small contribution not withstanding, I believe we would see more innovation in this direction.

Situation as it stands today

Most, if not all streams are at this point in evolution. But I could be wrong. I am not privy to the inner workings of companies. My observation is based on what I see and guesstimate the evolutionary phase. So what else can we do to trigger the next evolutionary phase? Is there a theoretical limit to what can be achieved? Can a truly relevant/diverse system be designed?Can the holy grail of absolute relevance be achieved?

These are the questions that I struggled with. Then inspiration hit me! Nature abounds with solutions to hard problems. How can small insects like ants and bees be successful species? They perform complex tasks which are disproportionate with their brain capacity. Of course this is nothing new. There is research going on Nature inspired Computing, Artificial Intelligence, Computational Intelligence, Bio Mimicry, Neural nets etc.

What we need is to apply the insights from these fields on to human consumption behavior. The nearest system I can think of is the foraging behavior of ants. If we feel too uncomfortable with that, let me say we are like bees: social, intelligent and actually do a dance to communicate!!!! (waggle dance).

If you want a taste of these algorithms, you can start reading on my blog Clever Algorithms in Python .

Though we have not seen the next evolutionary phase, we can at least describe the shape of what it would be. I would hasten to add that like biological evolution, extrapolation would be widely inaccurate. For e.g. if we as an observer went back in time to dinosaur age, would be predict the evolution of humans or more grotesque creatures straight out of a Alien movie?

All is not lost though. I can safely venture to say, that we can predict the next nearest evolutionary step with a fair degree of accuracy than the shape of evolution many steps away….

So let us start. Please note this is only my opinion and like I said could be totally wrong.

Context Based Adaptive Stream (Contextual Stream)(CBAS, COS)

I would like to propose that the next evolutionary step would be the rise of Contextual Streams. Some key characteristics would be natural serendipity, redundant systems, scale invariant, adaptive and sense making. The idea is to provide a non deterministic view to every user. Each user sees a highly customized view specific to him. Even more, this view is not deterministically customized by the user, but organically grown. Think snowflakes, than manufacturing toasters. Each action of the user simultaneously determines the future as well as the ‘past’. You might say, how can the past change? Keep in mind, ‘past’ is only a representation made by us. Individual facts cannot be contravened, but the way we combine them could produce new meaning, in light of the present facts!!!!

Without degenerating this in to an Issac Asimov narrative, I would like to conclude that a lot of emergent systems theory, complex adaptive systems and nature inspired algorithms would definitely be a part of the solution in some form.