Building An Aggregation Tool To Collect & Organize Data — CB Insights Research

Ahmed Hashim
CBI Engineering
Published in
5 min readFeb 17, 2020
Photo by Glenn Carstens-Peters on Unsplash

There’s a wealth of data sources out there that could be useful when building a feed. Aggregating them in a way that makes sense for a user is key.

Let’s say you want to build an information feed that aggregates data from different sources.

Displaying this feed in chronological order (from newest to oldest) so all the content can be read in one place seems like table-stakes logic. Assuming you could store all this information in one data source (e.g., RDBMS), you could query and sort the data to suit your needs.

But as you can imagine, storing all of that in one place is impractical, if not impossible.

Let’s try something different.

The problem of relativity

Suppose these data feeds are social network APIs. Each individual API allows you to fetch a number of posts sorted — like having each one on its own database where you can query, sort, and paginate.

A naive solution might be to grab the first page from each of your sources, combine the results, and sort the aggregated results set, but there is no way to guarantee that the order would be consistent as the user scrolls down.

As seen below, Twitter’s second page of results has posts that should be displayed before Facebook’s last item.

This is exactly the problem we had to solve when rebuilding the home feed here at CB Insights.

Queueing up: A simple and effective solution

In the previous example, it’s clear that the first Tweet should be displayed before the first Facebook post, because it’s the newest of the two.

So we can treat the posts from each social network as a queue, in which we only pull the newest post from all sources, and as the user scrolls down, we keep pulling one item at a time for as long as there are still items in all queues.

Once any of the queues is exhausted, we hold the unprocessed posts in memory, fetch more items from the depleted list, and the process is repeated.

Our main site is a single page React/Redux application, so we can use a client’s browser memory to hold those items that can’t be displayed yet (because we don’t know the exact order).

But the same behavior could be achieved on the server-side by using persistent storage to store unprocessed items. This solution is also memory efficient as at most it will store Page size * (Number of Sources — 1) items.

The nitty-gritty: Implementation

We can leverage the Redux store to hold the state of the different queues in memory while we make requests to pull information and determine which articles to show.

Here’s an example of what this implementation might look like in the feed reducer slice of our store:

Initial reducer state
Initial reducer state

Upon load, each news source gets queried asynchronously to fetch articles. The results are then sorted by date descending, and finally tacked on to the “items” key in their respective slice of the store:

Storing articles from various feeds
Storing articles from various feeds

When all the requests are complete, our sorting algorithm grabs N number of items from the head of each queue and mixes them accordingly until either a page size limit has been reached, or one of the queues is exhausted. We then take this new list and place it in the itemsToRender key, which displays the final result to the user:

After we sort & determine which articles to render
After we sort & determine which articles to render

Once a source is exhausted (i.e., the request returns fewer articles than the page limit size), we mark it as complete and remove it as an input from the sorting algorithm:

Feed source is complete
First feed source is complete

An added benefit: new pages load instantly as a user scrolls down on the infinite feed (if there are already articles in the store). We also limit the requests made to fetch more items until we deplete all the articles in memory.

Because the initial requests are non-deterministic, we handle them in our Redux action creators, using Thunks to fetch the data which we then pass into our feed reducer.

Passing data from a thunk into the reducer
Passing data from a thunk into the reducer

Now we can have deterministic testing around our data store, as well as both algorithms (sorting the initial results, and later combining the articles across data sources). This technique allows us to asynchronously fetch articles and serve them consistently and efficiently when querying multiple data sources with parameters we may not have control over.

The world of data, filtered through you

This process clearly isn’t just about social networks. There’s a wealth of data sources out there that could be useful in the feed you’re building. Aggregating them in a way that makes sense for the user (even when you don’t have control over all parameters) is key.

And doing it efficiently? Now we’re really talking!

If you think you’ve got what it takes to help us sort & analyze data efficiently, come and work with us.

Originally published at https://www.cbinsights.com.

--

--