Data Engineering with Humans in the Loop

Wyatt Shapiro
B6 Engineering
Published in
4 min readOct 7, 2020

When we talk about data from a software engineering perspective, we often talk about how big it is: the number of events (volume), the rate at which it is generated (velocity), the different formats it comes in (variety). These terms are especially helpful for boasting about the power of our machines.

But, more important than all of those things is the actual usefulness (value) of the data to the people consuming it. In other words, the power of humans to drive insight and perform actions as a result of the work the machines are doing.

Photo by Rishi from Unsplash

In our case at a commercial brokerage, we wrangle data from third party and public sources that includes entities such as Transactions, Mortgages, Property, Ownership, and Lenders. One of our goals in ingesting these streams of data is to figure out exactly who owns a property and present it to our brokers. This is essential to our brokers because it’s impossible to reach out and eventually represent an owner in a property sale or mortgage if you do not know who they are in the first place.

However, the truth of ownership is harder to discern than meets the eye. Oftentimes, LLCs are constructed for the purpose of a single transaction or attorneys are signatories on a sale deed. The data we can automatically provide to our brokers is limited to what is publicly available from the government or other third party sources, and ultimately that isn’t granular enough to give our team a competitive edge.

So how do we make data like this more useful to brokers?

It all comes down to building tools that enable our brokers to enhance the data themselves. We need to design data systems that can help experts in their day-to-day work and that experts can help through their usage.

By trusting our brokers to actively work with their data, we can leverage their expertise to improve our data which benefits property owners, other B6 brokers, internal data analysts, and even our machines.

Here are two processes that we have designed and implemented that enable our machines and human experts to work together:

1. Weekly Transfer Verification

Every week our brokers receive a personalized list of transactions that have occurred in their territory. Originally, we were interested in ingesting the transaction data and loading it directly into our applications and data warehouse to analyze. However, the public records of these transactions didn’t contain the breadth of data (ex. cap rate) nor depth of data (ex. actual buyer) that the brokers or analysts desired. In order to help mitigate this issue, we created a manual verification process that occurs before data is propagated across our system. This is a chance for brokers to review data for errors, add additional data points, research shell companies to find actual buyers/sellers, and even add context to why this transaction may have occurred.

Not only does each broker have an automated pipeline to understand what is happening in their territory at a granular level, it also allows them to identify and reach out to the active sellers or buyers they have uncovered who may be interested in doing more deals.

The downside of this system is the data takes a longer time to reach a state where we are confident it can be analyzed downstream. This is because there is a lag in how quickly a broker or analyst can verify the transaction data that is flowing in. However, this seems like a worthwhile price to pay in order to get a high quality of data while simultaneously informing the brokers of new activity they are interested in knowing.

2. Entity Resolution and Merging Duplicates

As with many sales organizations, we use Salesforce as our CRM. One issue with managing contact and company data is the seemingly endless amount of duplicates that are created. As the CRM fills up with duplicates, users are less likely to improve any one company or contact because edits will not be propagated to the 7 other instances of that company. It is easy to believe that an algorithm could be trained to reduce the number of duplicates quickly, but in practice we found it hard to rely on that process alone.

In our training sets using actual companies and contacts, we found many edge cases that were difficult to identify. This entity resolution became even more difficult when there were not many data points collected for a given entity. As an example, when we ingest mortgages sometimes we will only get the name of the lender. Without other data points such as email domain or underlying contacts, it can be very difficult to identify duplicate lenders. How can we ensure an algorithm marks CIT Bank and Citi Bank as unique while simultaneously flags JP Morgan Chase and JP Morgan as duplicates?

In general, we can deploy an algorithm that favors precision so as to reduce the number of false positives. But, then we have not achieved our goal of significantly reducing duplicate contacts and companies!

This is why we also employ a tool that enables our brokers to merge duplicates themselves. As brokers merge duplicate companies or contacts that were not caught by our machines, we can log the manually merged sets and continue to get more training data to refine our duplicate detection algorithms. In this way, the machines reduce the duplicates as a first pass, our expert brokers further improve the data with a merging tool (we highly recommended Duplicate Check For Salesforce), and we collect those human merges so that the algorithm can be finely tuned on those real world cases later.

From our work in the commercial real estate space so far, it is clear that building analytical data systems that rely on machines alone is not enough. Instead, the engineering and product teams at B6 will continue to focus our efforts on delivering technology and processes in which machines can help humans and humans can help machines.

--

--