The Internet is HUGE and search engines and social media are only showing you a tiny fraction of it.
Picture Kevin. He’s a small town fellow who spent all of his 18 years on a ranch in rural California. He recently came up with an extremely delicious omelette recipe that’s the been the talk of the town. He thinks the whole world will benefit from the recipe if he were to put it online and hopes it would reach as many people as possible. Despite having come a long way in the last 3 decades, the Internet still has yet to fulfill its hype of being an effective medium for effectively sharing one’s ideas with the rest of the world.
Anyways, Kevin decides to publish his recipe on his personal blog which is indexed by Google but is getting only a few hundred visitors a month if he’s lucky. When people search for “omelette recipes” in Google, the result for his recipe comes up only on page 9, and virtually everyone who made the search has picked another recipe earlier in the search results.
All’s not lost yet — Kevin decides to share his recipe on Facebook. Even though Kevin has 200 friends on Facebook, only a small fraction of them saw his post, most of whom were friends he had known for years in his small town and who already had knowledge of his recipe. As a result, his recipe fails to go viral here.
Still being persistent, Kevin posts a link to his recipe on Twitter. With only 100 followers, the result is similar — only about 20 of his followers saw his post and only 2 of them clicked on it, none of whom bothered to reshare his post.
Kevin’s quite a tenacious fellow and at the very end, tries to post a link to his recipe on Reddit. He found several subreddits that might receive his recipe well like /r/recipes and submits and crossposts the recipe to all of them. Most of the subreddits upvoted his post modestly, usually between 1 to 10 points. A few downvoted his post to obscurity. One even banned him for self-promotion. Still, he managed to get 300 visitors to the blog post with his omelette recipe over the course of 5 days before this surge of traffic completely died down.
Still refusing to give up, Kevin further explores his options. He optimizes his blog post for SEO and manages to go from page 9 to page 3 on Google which brought him 1–2 extra visitors everyday. He considered advertising to bring more visitors to his recipe but decided against it since he, as an 18 year old, didn’t have much money to spend especially on a blog post that wasn’t even monetized.
Around the same time, Cathy in New York is searching for new omelette recipes to feed her family. She tried nearly all of them on the first page of Google and wasn’t impressed with any of them and decided to start cracking the 2nd page. It would be quite some time before she would even reach Kevin’s recipe sitting on the 3rd page, assuming it continues to maintain its ranking.
Also around this time, Sanjay in India was searching for new recipes on the subreddit /r/recipes and came across Kevin’s post but didn’t bother to click on it. He found neither the title of the post nor the 5 meager upvotes it garnished very attractive and didn’t bother even clicking to see what it was about.
Both Cathy and Sanjay represent the frustrations that so many Internet users face when searching for information or content on the Internet and only coming across mediocre results on the 1st and 2nd pages of Google or Reddit and the same old links being shared on social media. Both of them would have been delighted to have come across Kevin’s omelette recipe. And Kevin’s frustrations represent what most content creators face when trying to get their word out only to be drowned out by countless mediocre results on the search engines or social media.
Don’t get me wrong — search engines and social media do have their place when it comes to content discovery on the web. But despite their widespread acceptance year after year, the Internet clearly needs a new paradigm for connecting content creators with content consumers efficiently and effectively. For this, Gimmeserendipity serves to fill a gap overlooked in the search and social media worlds.
Limitations of Search Engines
Search engines, despite their imperfections, are probably your best option for finding specific information on the Internet and have been the de-facto standard for navigating the web since its early days. In theory, you enter a word or phrase you want to find websites regarding and it’ll spit out a list of what it deems as the best results. “Best” is very subjective and depends on the search engine’s complex page ranking algorithms. Usually the search results are served in increments of about 10 per page. Most people will only look at the top few search results and rarely venture past the first page.
So if you want to search for something rather broad like omelette recipes, there are probably way more than 10 in the world that are worthy of your attention but the search engine will only be able to fit 10 into the first page of its results. And everyone who searches for omelette recipes will likely see the same 10 results. As a consequence, the top 10 results of a search determined by a search engine’s relatively arbitrary algorithm for determining relevance end up receiving a disproportionate share of the search traffic while the rest lie in obscurity even though many of them are worthy of the attention of the person who had entered the search term. In other words, despite the vastness of the Internet, most of the world that’s interested in a particular search phrase will only see the top 10 results from a search and never bother to go past that. The 1st result might receive 1000 times as much traffic as the 11th result (which is on the 2nd page) even though the 1st result probably isn’t 1000 times as better as the 11th result.
What’s wrong with (most) social media?
Social media can be useful (and entertaining) at times, but using it as a way to find information is a bit backwards. Ever hear of the saying that “who you know is far more important than what you know”? On social media, this maxim is more true than ever. Besides, back in the Dark Ages, most people believed only what their friends, relatives, and others in their community believed. Finding useful information on social media, especially Facebook, is the equivalent of relying on social proof to find and judge the merits of information online, which is basically the 21st century equivalent of how people researched information back in the Dark Ages.
Reddit is an exception though since it functions more like an anonymous forum. However, Reddit also suffers from a similar issue as search engines in that the submissions that end up on their Front Page tend to receive a disproportionate share of the traffic even though there are plenty of submissions that don’t end up on the front page that are worth of more merit.
I propose a Serendipitous Discovery Platform
No content discovery platform is perfect, not even the one I’m proposing, but I would like to contemplate a solution that addresses some of the common pitfalls plaguing popular search engines and social media. In other words, this platform should be resistant to all of the following drawbacks:
- Prone to clickbait: it should not rely on the user to “judge the book by its cover” and pick and click the entry with the catchiest title.
- Induce decision fatigue: similar to the above, the user shouldn’t have to take the extra step of picking a link to click on (in both search results and social media feeds.) Instead, he/she should be directly taken to the result that is deemed to be the best fit for him/her.
- Benefit the well-connected: Someone with 0 friends and followers should have an equal shot at promoting content on this platform as someone with millions of followers. In fact, I propose that the social media component of adding friends/followers not even be implemented.
Here’s how a simple Serendipitous Discovery Platform should operate:
- Every “round”, the user is presented with content and is given a chance to vote on whether he/she likes it.
- The next round, the user is presented with another piece of content after casting his/her rating (or requesting a new piece of content without rating.)
- Over time, the machine learning algorithms behind the scenes figure out the user’s preferences and presents content every round that it predicts will receive a high rating from the user. The more ratings cast by the user, the better the algorithms become at serving content that the user likes.
Such a discovery platform isn’t anything new or revolutionary. Here are examples of services that use a platform similar to my Serendipitous Discovery Platform:
Pandora is a music discovery service that works by having you rate songs, albums, and artists. It plays a new song every “round” and offer the user the chance to rate it (thumbs up or down) as well as skipping it without casting a rating. The more songs, albums, and artists that the user rates, the better its predictions become.
Youtube is a video hosting service that has a built-in search engine and social media features (where you can subscribe to a video creator.) I’m not talking about these here but instead, I’m focusing on the feature that kicks in every time a video finishes playing: at this point, Youtube will pick another video to automatically load based on the user’s likes and dislikes of previous videos he/she has seen.
I’m not a Tinder user since I’ve already been happily married by the time I had known about Tinder. So what I’m gonna say here is merely based off of my limited impression about how Tinder works. Here you’re presented with a profile of a potential date every round and have to swipe to rate whether you like or dislike that particular potential date before you’re presented with a new potential date the next round. And the machine learning algorithms get better at figuring out your dating preferences the more potential dates you rate.
Stumbleupon was a website discovery engine that functioned much like Pandora except that it recommended new websites to the user every round rather than new songs. It’s now been replaced by Mix which functions more like a search engine and social media site (although Stumbleupon did have various social media features like following other users to get updates on their activities.)
Of all of these discovery engines, Stumbleupon is closest to what I propose, although there are a few key differences:
- I go with a 1–5 star rating system, while Stumbleupon used a binary system (like or dislike.)
- I focus on a content-based recommendation engine while Stumbleupon focused more on a collaborative-filtering type of recommendation engine.
- I use a separate engine for image submissions than for website submissions, while Stumbleupon treated them all the same. Unlike collaborative filtering, it’s not technically feasible to treat images and websites the same when utilizing a content-based filtering engine.
- I require the user to scroll all the way down before rating the website rather than leaving the rating UI at the top so that the rating doesn’t merely reflect the top part of the website. The website appears in its own iFrame and may have its own scrollbars if it’s really long, so this isn’t foolproof either but at least it requires the user to look at more than just the top of the site before rating.
Now we’re getting old-school. Webrings were a popular way to promote websites when I first discovered the Internet back in the 1990s. Webrings usually have a single theme like gaming, science, travel, fitness, etc. and all of its sites are linked together in a circular structure. All sites in the webring are required to display the webring’s UI somewhere in their homepage with “previous”, “next”, and “random” buttons taking them to the previous or next sites in the ring or picking a site at random from within the ring.
Although Webrings only supported a single category and didn’t utilize machine learning to provide recommendations for the user, their mere existence in the past shows that serendipitous discovery of content on the Internet has been a multi-decade old problem.
Why Serendipitous Discovery?
Imagine entering a candy store as a kid and the owner presenting you with a piece of candy. You give that candy a try and tell him whether you liked it or not and he gives you another piece to try. Each subsequent piece of candy you receive becomes even more aligned with your tastes. However, if you already know the particular candy you’re looking for before arriving, this approach is less than ideal. But if you don’t know what’s out there in that store, this approach makes much more sense.
There are Known Knowns
To be honest, I’m not a big fan of politics or the Iraq War back in the early 2000s, but this Donald Rumsfeld quote has profoundly altered the way I understood the limitations of my knowledge:
Reports that say that something hasn’t happened are always interesting to me, because as we know, there are known knowns; there are things we know we know. We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns — the ones we don’t know we don’t know. And if one looks throughout the history of our country and other free countries, it is the latter category that tend to be the difficult ones.
Known knowns are truths that you already know that are true (like there are 24 hours in a day.) Known unknowns are things you are aware that you don’t know about or basically any gaps in knowledge you’re aware of (for example, you might be aware that you do not know anything about quantum physics.) Unknown knowns, although not mentioned in the quote, are things you think you “know” that are actually false, like believing the world is flat in the Dark Ages (an alternative interpretation can also be the things you instinctively know but aren’t aware that you know.) Finally, unknown unknowns are things you don’t know that you aren’t even aware of not knowing about.
Search engines are ideal for converting your known unknowns into known knowns assuming there is an answer to the question you’re wondering about. For unknown unknowns and unknown knowns, a different approach is needed since you won’t be consciously searching and seeking for these answers. At first, social media might sound like it could fill this void, though upon deeper examination, it’s still less than ideal. First, the types of information and links that you’ll find in your social media feed is often influenced by the preferences of the people you’re friends with or following. Second, one must actually click on the links in one’s social media feed and we all know that link titles that tend to be familiar or sensational tend to receive more clicks.
Almost by definition, there are no obvious ways for a user to discover information that falls in their domain of the unknown known or unknown unknown. However, there is a very easily implemented but flawed solution that I haven’t mentioned yet: a system that simply has a large database of website URLs and gives the user a URL at random each time. This, too, also has obvious drawbacks since it fails to take into account an individual user’s personal preferences for different types of subjects (imagine being served 5 websites on French ballet when you’re really interested in racing cars instead.) So Gimmeserendipity will rely on the user’s repeated feedback in the form of ratings to determine his/her reading preferences.
Gimmeserendipity doesn’t use Collaborative Filtering (as of 9/2019)
Collaborative filtering is a commonly used algorithm for making recommendations for new items (like books, videos, potential dating partners, etc.) by considering the correlations and differences in how other users rated existing items vs how the existing users rated the similar items. This is in contrast with content based filtering which makes recommendations for items with features or content similar to the ones the user rated highly. Most recommendation systems in practice use a combination of both but lean more towards collaborative filtering.
To understand what collaborative filtering accomplishes at a high level, picture yourself running a recommendation system and opening the master spreadsheet containing all of the users on one axis and all of the items being recommended on another and each cell contains how a particular user rated a particular product. Of course, not all users have rated all products and not all products have ratings from all users so some cells will be blank. Collaborative filtering will work to fill in estimates for all of the blank cells. Notice how I don’t need to have any domain knowledge of the problem at hand when using collaborative filtering, be it recommending books, music, pictures, etc. to a user, unlike content-based filtering. Unfortunately, this won’t work too well for a new recommendation service with very limited users and rating data (known as the cold-start problem.)
Gimmeserendipity currently relies on a content-based approach for making user recommendations, mainly to work around the cold-start problem. It has separate engines for recommending pictures and for websites that extract completely different features to input to its machine learning models. These models are also periodically retrained every time the user has rated several new websites or pictures. I had considered adding collaborative filtering to the recommendation engine after receiving a decent number of ratings from numerous users that have signed up in recent months, but I’ve decided to hold off on it for now for one major reason: feedback loops.
A Word about Feedback Loops, Echo Chambers, and Conspiracy Theories
The ugly side of collaborative filtering has started to make waves in the media in the late 2010s as services like Facebook, Twitter, Youtube, etc. have been accused of fueling conspiracy theories through their recommendations which use collaborative filtering. There are numerous theories behind how collaborative filtering creates echo chambers through feedback loops which, in the worse case, eventually lead to one’s feed being cluttered with content related to conspiracy theories and other extremist content. Content-based filtering isn’t immune to this effect either although it hasn’t been shown to exhibit it to such a degree. Creating an echo chamber is not the most ideal outcome especially since it’ll also defeat my mission to make the user aware of his/her unknown knowns and unknown unknowns. Feedback loops also tend to disproportionately attract traffic and attention to already popular and highly rated content, making it difficult for newly submitted content to reach a wide audience (hence Kevin won’t get much traffic to his newly submitted omelette recipe unless he’s really lucky since all of the popular submissions are sucking up all of the visitors.) Until further research on the link between collaborative filtering and feedback loops becomes available, collaborative filtering will not be considered.
If you post something to the Internet, the whole world should be able to see it, right? But in practice, only a tiny fraction of the world actually sees it due to the limited and imperfect ways of finding and discovering content on the web. This quote from cofounder Kenny Daniel of the AI startup Algorithmia pretty much sums up this plight of the web (if you were to replace “research paper” with “website”):
The future is already invented; it just happens to be stuck in a research paper somewhere.
Likewise, I believe that ever since the Web’s creation in 1989, lots of magnificent ideas that have tremendous potential to change the world for the better have been shared on the Internet but have failed to reach escape velocity out of their little corners. Gimmeserendipity’s mission is to uncover these ideas through a more meritocratic means of serving this content to the world.