A world in text — Strise

Published in

Strise

4 min readMay 13, 2020

The backdrop of Strise’s mission to make the world’s information useful across the enterprise is our technological platform. This is how we made it.

Analogous to how the world never sleeps, neither does the constant stream of data it sends into our system. Combined with our global ambitions of empowering individuals to always have the information that matters the most at their fingertips, this results in some very interesting technical challenges.

Not only does the platform need to be up 100% of the time, our model of the world, the Strise Knowledge Graph, also needs to be continuously updated, enriched, and monitored — such that we’re always serving our users updated information about the things that they care about. It also won’t hurt if the system is economically viable as well!

Luckily, as proven for the past four years, Google Cloud provides the perfect platform for building such a system, and if put together correctly, can fluently scale with the business requirements of our company, without sacrificing speed, quality, and developer sanity.

Infrastructure

Our take on this was to build a system of services running solely on preemptive nodes in Kubernetes, assisted by a framework of ephemeral Dataproc clusters for heavy-duty data analysis jobs — all of which is glued together with Cloud PubSub for internal message passing, Stackdriver for real-time monitoring and logging — with important events conveniently routed directly to our company Slack.

Content

From a data standpoint Strise processes roughly three million incoming documents in several languages every day, continuously monitoring more than 40 million entities ranging from companies and competitors to fine grained industry trends. These events and entities are in the end what constitutes the Strise Knowledge Graph, for which all our services are built atop.

Semantic Processing Pipeline

One of these services is our semantic text processing pipeline, which is logically split into the following three parts:

Content ingestion
Semantic processing
Knowledge enrichment

Strise combines both the continuous flow of external data and user interactions, to create a personalized experience through the Knowledge Graph

Content ingestion

At the time of writing, Strise monitors more than 160k sources of information, which enters our system through numerous protocols in a variety of formats, and with different degrees of importance. Some are viral Tweets from bloggers seeking exposure, others are news reports signaling adverse behavior from a local company in a foreign language, and some are simply junk.

Content ingesters are lightweight workers that clean and normalize all of this into a common format, and publish it to different PubSub topics depending on the content for further processing.

Semantic processing

As a human, reading a piece of text and understanding its contents is a pretty trivial task, but for computers it’s a whole other story. When a human reads “Tesla founder going crazy on Twitter”, a natural assumption is that “Elon Musk is at it again”. Automatically deciphering both that Elon Musk is the founder of Tesla, and that he has a history of causing headlines with his Tweets. A computer, on the other hand, would merely see a bunch of letters. Even if the computer knew that Tesla should be an entity, how would it know that it wasn’t referring to the father of magnetic flux?

This is where our semantic processing pipeline comes into play. Now with all the data in a shared format, the data is dissected into its most basic parts, and put together in the form of entities (e.g. companies, locations, trends, …) and relations (e.g. “Elon Musk” “isFounderOf” “Tesla”). Using the Strise Knowledge Graph as a blueprint of the world, the pipeline assembles a graph of entities in the most probable configuration with human-level precision, for each and every document.

As a result, the system learns what we all knew all along: That the Tweet was indeed referring to Elon Musk.

Knowledge enrichment

But the work is not done yet. For each new document that enters our system, there’s also the potential to learn something new about the world. Did the Tweet affect Tesla’s stock price? Were there any news outlets reporting the incident? Did Tesla lose a contractor in the aftermath? While easy for a human to answer, given access to the right material, these questions are extraordinarily complex to digitally infer in the general sense.

The knowledge enrichment step is where we take everything we learned from the incoming document and put it back into the Knowledge Graph, which in turn updates its view of the world.

Staying ahead

Combining these three stages, the Strise platform is capable of learning from itself with minimal human interference, currently counting 6.5 billion pieces of information. At Strise we believe this knowledge is the key ingredient in building a modern platform for staying ahead of the infinite stream of new information, and ultimately, change.