How to improve search, get better answers, and do less work.
Note: This was written in 2016 but never released, recent updates to tab behavior in Google Chrome reminded me of this article so I decided to publish it anyways. Enjoy!
I’d like to propose something a little out there. Got 5 minutes? I want you to volunteer your time and your computer’s processor to help make the internet a better place. With your help we can have more meaningful search results which will lead to finding the answers you want, faster.
Search is a very basic problem. You begin with a stack of data and you need to find the piece of data that answers your original question. Looking for a needle in the proverbial haystack. No matter your search domain, the issue is a classic quantity and quality dilemma. The more options you have, the longer it takes to go through the files. The more random the various articles (let alone simply misleading or false) the longer it takes. That can be described as a noise problem. A smaller search domain doesn’t help either. Good luck finding the cause of a “runny nose” in pre WWII medical documents where the lingua franca was mostly German. Today, you can search for the most malformed query possible and get a reasonable answer in a short time-frame.
The giant of Google is built on solving this problem. And they are really good at it. And it is really expensive. Hundreds of Thousands of servers across the world employ a software army, armed with algorithms, to visit myriads of websites, make notes, look for references and build a giant index to help us harness the immense stack of data we call the internet.
A few years ago, it was estimated that every day there were 140k new websites. That number only goes up.
The robot army is not a bad solution. Search companies look at websites, figure out ways to make sense of the data, and then structure the programs (algorithms) to process the page to various levels of comprehension. Then they move on.
What if the protein army could help? I mean people, no one cares about your opinion, gym-bro. What if we harnessed the power of the crowd and what we know about network effects to add something to the magic sauce?
The web of data is connected on multiple levels. Most of that has to do with explicit links where article A points to article B. The more important your website seems, as in the more people pointing to your website or the more credible sources pointing to it (edu, gov, org domains) the greater the importance that is placed on your little node of the network. Simplified, if Harvard points to your tax advising business, you’re probably trustworthy. How do we know Harvard is trustworthy? Because the internet says it is.
There’s a lot more that goes into determining the importance of a website which is why we have a whole industry around the topic of search engine optimization. Google is notorious for not sharing the recipe. The holy grail is that accurate and open content floats to the top when you search for life’s more challenging questions.
The rise of artificial intelligence promises a solution to this “validity” question. It’s able to make sense of the article and ascribe meaning in context of the vast amount of data out there. It doesn’t need thousands of people saying the website is valid. It can figure that out for itself.
The problem is, if indexing the web was expensive. Running machine learning on every website is expensive. Like, you can’t even Big-O estimate that because the technical limit of Artificial intelligence is just how much you want to restrain the graph of knowledge it’s using for a reference.
That’s where the protein army comes in.
On any given day, millions of people fire up their browsers tasked with some topic to which they are trying to find the answer. If they are like most people, they’ll enter their search request and then click on the first five or six links with one very significant change in behavior compared to searches of yore. They’ll open it in a tab. Then they’ll rephrase their question a couple of different ways, look for promising titles, and repeat the action until they have a browser with 10–20 tabs that SHOULD have the answer.
Over time, they’ll end up with multiple browsers filled with tabs from various topics representing various interests and questions they have come across over the course of a few days or weeks. At the moment, this is a terrible practice — if not very common. This author is as guilty as anyone else, but what if it wasn’t a terrible idea?
So, here’s the crazy idea. What if your browser could read the articles for you? What if you gave your browser permission to look across the pages you were “tabbing” to find the relevant pieces between the documents? Together with the original query information and the power of the web, your browser can extrapolate a much deeper level of understanding into a single page document that prioritizes the information from the most repeated to the least. That’s the power of artificial intelligence. Suddenly, 20+ tabs of data can become a single page with the most relevant data in a single document for you to consume.
From a product perspective, this data would allow you to tweak the results as needed and then this analysis would be sent back to the server to provide an all-together deeper understanding of the relationship between pages of data and the intent that brought those documents together. Think of it as micro-webs of data with search queries as the nucleus. No longer requiring a flat web of data to find the answer, but being able to build a type of knowledge organism composed of a cluster of these search atoms.
Now wasn’t this whole AI thing supposedly too expensive to run at scale? Yes, but when it’s run on private computers across the globe, that’s no longer the search company’s concern. The workload is offloaded to the client parsing the data. I can load up 30 tabs on Intermittent Fasting, hit “go”, and then go to bed while my processor plays scorched earth with my office desk. Or I can simply have a real-time tab update with new results as I add to my search domain and work async with my personal RA. When I’m done, I have the results outlined and ready to go and the search company has an intensely deeper understanding of those 30 pages, what kind of people would find those 30 pages most relevant, and which pages potentially are just copying everyone else.
That’s it. The elevator-got-stuck-pitch about harnessing the lost utility of willing participants in the data parsing game. There are a lot of execution details to be figured out including user expectation in regards to battery consumption, how to avoid bad actors uploading skewed results, as well as a rather intricate user, flow from a UX perspective. Maybe more on those details later. Leave your thoughts in the bottom or feed my ego and follow me on Twitter @motleydev — either way, thanks for reading!