Using the New Enigma Python SDK

Published in

Enigma Engineering

8 min readAug 22, 2018

This week, Enigma announced the release of the Python SDK for its public data platform, Enigma Public. The SDK is designed to make it easier to locate and import public data into a Jupyter notebook or any other Python project using the Enigma Public API. I had a chance to try it out before its release — here are some of my findings.

A while ago I built an app, Enigma Public Power Search, before the Python SDK existed. The app is written in Python and makes heavy use of the Requests library to make HTTP calls to the Enigma Public API. There’s a lot of requesting, a lot of looping to handle pagination, and a lot of digging through long JSON responses. Since an SDK is supposed to wrap finicky HTTP requests, provide helper methods to perform common tasks, and let developers interact with native objects instead of navigating through complex JSON hierarchies, I figured the new Enigma Python SDK would simplify things. I wanted to see how much easier it would be to write the search code and whether it could actually improve my app’s performance.

The app’s core feature is a “proximity search” function that searches across all of Enigma Public for data records where the words you want are in the same or adjacent columns. Often people search Enigma Public for a specific person, company, address, etc. The standard search returns a lot of false hits and doesn’t offer a way to view the resulting rows without opening each dataset in turn, which can take a while. For example, you might be looking for information on “Michael Cohen” but you get hits on “Michael Smith” who lives on “Cohen Street” or “Stephen Cohen” who’s an officer of the same company as “Michael Ferrone.” Power Search greatly reduces the number of false hits and displays all matching rows across all datasets.

In this post, I’ll show you some of the original proximity search code alongside the new SDK-powered version. I’m no power Pythonista, so you’ll have to excuse my hacky code. However, I can offer some insights into the new SDK and why you might want to use it if you’re a Python programmer interested in exploring public data.

How proximity search works

Here’s an outline of how the proximity search function works:

It searches Enigma Public to locate all datasets that include the specified search words. This will include all those “false hits” mentioned earlier.
It then fetches the matching rows from those datasets. The function fetches them in batches because if the number of matches is large, the connection may time out or the API may refuse the request.
For each batch, it goes through each row of each dataset to determine if the search words are in the same column (distance=1), adjacent columns (distance=2), or within three adjacent columns (distance=3), depending on the user’s selection.
It gathers up all the proximity matched rows along with information about the dataset each group came from, and returns these to the app.

It’s a brute force algorithm that requires a lot of API interaction and fetching of data, but it works! Let’s go through each step in turn, comparing the pre-SDK code with the SDK-powered code.

Getting the dataset count

Step 1 is to locate all datasets that include the search words. Here’s how the pre-SDK function works:

It sets up the required HTTP session object.
It makes a HEAD request (which returns only the response headers and not the actual data) to the /datasets/ endpoint. The query parameter specifies the search words (for example, “michael cohen”). Setting match_metadata and include_serialids to false says we’re not interested in metadata hits or serial IDs.
It reads the response headers and extracts the total number of matching datasets from the content-range header.

Pre-SDK code:

In [1]:

import requestssession = requests.Session()
headers = {'authorization': 'Bearer ' + '<YOUR_API_KEY>'}
session.headers.update(headers)url = 'https://public.enigma.com/api/datasets/'
phrase = 'michael cohen'
params = {'query': phrase, 
          'row_limit':1000, 
          'match_metadata':'false', 
          'include_serialids':'false'
         }
response = session.head(url, headers=headers, params=params)
ds_count = int(response.headers.get('content-range').split("/")[1])
ds_count

Out[1]:

It’s reporting that there are 580 datasets on Enigma Public with at least one instance of “Michael” and “Cohen” in the same row.

The SDK simplifies development in several ways:

First, it handles the HTTP session and all HTTP requests for you. This includes automatic retries, connection pooling, and passing of API keys. You create an SDK client object, add your API key, and you’re done. If you’ve never managed HTTP requests before, it doesn’t matter.
Second, you don’t have to dig through response headers to find the number of matching datasets; the SDK handles this automatically.
Third (and this is a real bonus), the SDK’s datasets.list( ) method with the .all( ) operator returns a ResourceList object that represents the entire batch and supports all standard list operations (indexing, slicing, iterating over, etc.), but fetches data from Enigma Public automatically as needed. Nice! We’ll use it in the next step to get the data without making any further requests in the code.

SDK-powered code:

In [2]:

import enigmapublic = enigma.Public()
public.set_auth(apikey='YOUR-API-KEY')phrase = 'michael cohen'
datasets = public.datasets.list(
    query=phrase, 
    row_limit=1000, 
    match_metadata=False, 
    include_serialids=False
).all()
len(datasets)

Out[2]:

We got the same result and the code is a lot simpler, but the real benefits come in the next step. In case you’re wondering, it took just a fraction longer to return the magic ResourceList than it did to execute the HEAD request in the pre-SDK code.

Getting the matching rows in batches

Step 2 is to fetch the matching rows from each of the datasets located in step 1. The pre-SDK function makes a GET request to the /datasets/ endpoint. By default, the API returns results for the first 20 datasets. Although you can request more (up to 1,000) by including a Range header, there’s a good chance the API will refuse your request or time out if the quantity of data is too large. For this reason, you should request the data in chunks.

The pre-SDK code below requests the datasets in batches of 10 (an arbitrary choice, but seems to work). The API returns the requested data as a JSON list, where each list item represents one dataset. For each dataset, the JSON includes:

The dataset metadata (display_name, id, etc.)
Data for the current snapshot, including the field names and up to 1,000 records (since we specified row_limit: 1000)

Click here if you’re curious to see a sample JSON response.

The code below iterates through the list of datasets and creates a list-of-lists with all the matching rows, as well as information about each dataset. The third-to-last line truncates any values with more than 100 characters, since some datasets (for example, the U.S. Patent and Trademark Office datasets) include extremely long field values.

Pre-SDK code

In [3]:

results = []
for i in range(0, ds_count, 10):
    headers['Range'] = 'resources={}-{}'.format(i, i + 9)
    session.headers.update(headers)
    response = session.get(url, headers=headers, params=params).json()
    for dataset in response:
        results.append([dataset['display_name']])
        results.append(dataset['current_snapshot']['table_rows']['fields'])
        rows = dataset['current_snapshot']['table_rows']['rows']
        for row in rows:
            row = [value[:100] if isinstance(value, str) else '' for value in row]
            results.append(row)
len(results)

Out[3]:

Those 580 datasets yielded 24,432 matching rows.

Now let’s look at the SDK-powered version. Some things to note in the code below:

There are only two for loops, instead of the three in the pre-SDK version. The SDK handles pagination for us, so we don’t need to worry about batches.
We’re interacting with Python objects using the familiar object.attribute notation, rather than navigating complicated JSON trees.
We’re using the SDK’s nifty TableView class, which offers convenient ways to reference snapshot rows and their fields.

SDK-powered code

In [4]:

results = []
for dataset in datasets:
    tableview = dataset.current_snapshot.table_rows
    results.append([dataset.display_name] + [''] * 10)
    results.append([field.display_name for field in tableview.fields])
    for row in tableview:
        row = [value[:100] if isinstance(value, str) else '' for value in row]
        results.append(row)
len(results)

Out[4]:

Finding proximity matches

Now that we have all the potential matches in memory, we’re done with the SDK — it’s pure Python from here on. In case you’re curious how a hacker does a proximity search, I’ll continue.

The code below uses the same nested for loop, but analyzes each row before appending it to the results. Since it’s possible a dataset may not have any rows that match the proximity matching criteria, we delay appending the metadata until we know there’s at least one row.

The matching algorithm requires that all cell values are strings. Since some values are read as Booleans (True/False), it converts these to strings. It also filters out None values, makes everything lowercase, and removes any punctuation. The inner for loop traverses each row using a “window” of size distance (set here to 2, but you could set it to 3, for example, if you think there might be a separate column for a middle initial). If all the search terms appear within the sliding window at any point during the traversal, the match flag is set to True and we add the row to the results.

In [5]:

import stringtable = str.maketrans('', '', string.punctuation)results = []
distance = 2for dataset in datasets:
    tableview = dataset.current_snapshot.table_rows
    first_match = True
    for row in tableview:
        row = [str(value) if isinstance(value, bool) else value for value in row ]
        match = False
        for j in range (0, len(row) - distance + 1):
            words = (' '.join(filter(None, row[j: j + distance]))).lower().translate(table)
            if all(s in words.split() for s in phrase.lower().split()):
                match = True
                break
        if match:
            if first_match:
                results.append([dataset.display_name] + [''] * 10)
                results.append([field.display_name for field in tableview.fields])
                first_match = False
            row = [value[:100] if isinstance(value, str) else '' for value in row ]
            results.append(row)

To see the results, I’ll convert the results array into a pandas DataFrame so it displays nicely. The table is long, so I’ll display just the first few rows, but you could easily use the pandas to_csv() function to save the results and view them in a spreadsheet.

In [6]:

import pandas as pddf = pd.DataFrame(results)

Should I use the SDK or the API?

If you’re writing a Python app to explore Enigma Public, I strongly recommend the SDK. It offers all the functionality that’s available through the API, but makes things much simpler and your app will likely run faster. The SDK is only available for Python though, so if you want to write (say) a JavaScript app, you’ll need to use the API directly.

If you’re intrigued and would like to to build an app using the Python SDK, there’s plenty of information available on the Enigma documentation site. And if you do build something, please let us know! We’ve love to showcase it on our projects page.

__________________________________________________________________

We’re currently hosting a contest for projects — data science, data visualization, data journalism and more — featuring Enigma’s public data in exploration of the upcoming U.S. elections — enter today!

Originally published at www.enigma.com on August 22, 2018.