Hivemind Quick Start

Published in

Hivemind

7 min readJul 25, 2018

Welcome to the first Hivemind blog post! 👋👋👋

In these blog posts, we’ll look at aspects of data science pertinent to building high quality datasets, and more specifically discuss the related data science challenges and how features of Hivemind can help you deal with them.

But first, what is Hivemind?

Hivemind provides data science tools to help you build, clean and enrich your research and business datasets. At Hivemind, we specialise in building structured datasets from unstructured sources such as text, images, and videos… or information which has yet to be expressed such as thoughts, opinions, and expertise.

Hivemind works by breaking down problems into bite-sized tasks. Conceptually, Hivemind is agnostic as to who or what completes each task: data processes are distributed to the most appropriate human or computational agent for the job.

To solve complicated problems, Hivemind co-ordinates chains of tasks, where data collected at each link is inherited by downstream links. The resulting final dataset at the end of the chain can therefore be as intricate and comprehensive as is required.

Tasks can be created and monitored via our Studio portal. Studio also allows users to download task results, as well as administer new Hivemind users and keep an eye on the quality of contributor output. Contributors work on tasks via our Workbench portal. Data scientists and developers can take a fully programatic approach to building Hivemind workflows using our API.

An introductory example

This blog post will act as a jumping-off point for new Hivemind users looking to create their first tasks using the Hivemind UI. We’ll look at more programatic methods to create and analyse tasks using Hivemind’s API in later posts.

Ahead of this blog’s (rather niche!) worked example, here’s some food for thought as to the kinds of datasets that Hivemind has recently built or cleaned:

Complicated corporate action events such as mergers using unstructured primary sources such as regulatory filings, press releases, and newspapers;
Relationships between entities such as people, places, and companies from text documents;
Machine learning training data for both natural language processing and computer vision models;
Reference data mapping of company, person, product, and brand names to proprietary and third-party identifiers;
Comprehensive datasets of documents and reports, and subsequent collection of company specific metrics from those reports;
Tweet / newsfeed human sentiment analysis directed towards specific entities of interest.

Our worked example in this blog will show how Hivemind can be used to collect historical timeseries related to the Suez and Panama canals in the 19th and 20th centuries.

Hivemind basics: systematically collecting data from a document

Say we want to collect all of the timeseries values from this:

A century of operational statistics for the Suez and Panama canals (University of Florida Digital Collection)

You can see there are three timeseries (number of ships transiting, net tonnage, and gross revenue), for two canals (Suez and Panama), spanning a century (1870–1970).

The output dataset we’d like from this example has the following structure:

Each timeseries value that we’d like to collect is attributed to a canal-year-metric tuple, providing us with a very natural unit of work that we can distribute as instances of the task. Therefore, there will be roughly 2 canals x 3 metrics x 100 years = 600 task instances in total for this task.

Task creation

Hivemind’s Studio portal provides users with a UI to create tasks. The main aspects of task design are:

Providing initial task configuration details;
Writing instructions for the contributors completing the task;
Designing what the output of a task looks like;
Uploading any files that require processing as part of the task;
Populating the task with instances.

(Everything done using the UI in this blog can be done programatically via Hivemind’s API).

Task creation starts by giving your task a name, and assigning it to a project (projects define the broad goal of the work).

You then select how many different contributors you want doing each instance of your task. Humans are prone to the occasional (and sometimes not-so-occasional…) mistake, and therefore generating multiple iterations for each instance is often essential to produce high quality datasets.

Hivemind provides in-built agreement checking between different contributor answers — but we’ll keep that switched off for now, as we’ll be exploring how to reach a consensus from the crowd using additional Hivemind open-sourced software in the next blog.

The final bit of configuration involves assigning qualifications to your task. With Hivemind, qualifications are assigned to both contributors and tasks, and when those qualifications intersect, a task becomes visible to a contributor, leaving you with full control over who-sees-what.

Instructions and design

Systematic task instructions essentially act as the code that tells the contributor how to operate when tackling a task, often including examples and heuristics that help to deal with edge-cases that frequently arise when transforming unstructured information. In this example, as we’re transcribing well-defined structured content, the instructions are short and sweet:

You will be provided with a Canal-Year-Metric tuple.
Collect the timeseries value associated with that tuple via the provided document.

JSON Schema is used to design task output. A core philosophy at Hivemind when faced with a data problem is to first think about the ideal output data structure to solve that problem, and work backwards from there. JSON Schema provides the flexibility to design any data structure we’d like, as well as validation constraints to help notify contributors of likely data entry errors before task submission.

In our maritime example, we essentially just want a single number to be output, so our JSON Schema will be fairly basic. That said, there are often gaps in datasets describing events due to discontinuities in the real world. The 8 year closure of the Suez Canal at the beginning of the Six Day War in 1967 illustrates this point in our case. So it’s often sensible to provide contributors with a set of ‘get-out-of-jail-free’ options should the unexpected happen.

Here’s the JSON schema used to create this task:

Hivemind task output structure is designed using JSON Schema

In the code above, subschema is first used to define the data types that will populate your output form design. Combinations of string, integer, number, null, and boolean definitions can be used to create arrays and objects, before finally being plugged into the main schema object that’s displayed to the contributor. Generating dynamic JSON Schema per instance is easily configured using Hivemind’s API, and will be the subject of a future post.

Contributor output

Contributors entering data via Hivemind Workbench are provided with the interface to the left.

Populating your task with instances

Once the task skeleton has been defined, the next step is to upload any files we may wish to provide contributors with, to help with the tasks.

Files uploaded into Hivemind go into Amazon S3 buckets

We want our historical document to be accessible from each task instance defined by a canal-year-metric tuple. To achieve this, we upload the document to Hivemind by dragging and dropping the file into the UI, which stores the file in task-specific S3 bucket on AWS.

We provide a link to the uploaded file using the following syntax:

![file, Annual_Figures_Suez_Panama.pdf]

in each of our task instances, along with the canal-year-metric tuple, a unique instance identifier, and a human readable instance name. We feed this information into Hivemind in a similar fashion to uploading the files — by dragging and dropping the below CSV into Hivemind— where each row of the CSV will generate a single task instance.

Instance generation using a CSV

Lastly, we specify which fields we wish to display to contributors in each instance. Hitting submit at this point will release the task to Workbench, where contributors can get to work!

Generating ~450 instances from CSV (omitted rows when Panama hadn’t been built and Suez was blocked)

Conclusion

In this blog post, we’ve explored a relatively code-free introduction to Hivemind, using a trivial data collection example to provide a quick reference guide to create new tasks.

After having used Hivemind to solve hundreds of different data science challenges, I personally still opt for the UI when designing tasks as it’s such a visual process. For everything else — instance generation, results analysis, qualification distribution, …, etc.— I tend to use Hivemind’s API to programatically build and analyse ongoing workflows and large scale data collection processes.

In the next technical post, we’ll delve into the raw results from our Suez and Panama example to produce a consensus dataset, illustrating how Hivemind works alongside other data science libraries like pandas, numpy, and matplotlib.