How We Get Data Collected in the Field Ready for Use

--

At Spatial Networks, data collection is baked into our identity. Whether through Fulcrum, our SaaS mobile data collection solution, or through Foresight, our data-on-demand service, we are all about data collection.

It should come as no surprise that we lean heavily, though not exclusively, on Fulcrum for data collection in our Foresight service. Like many large Fulcrum users, we work with a large network of data collectors around the world, coordinating collection from our headquarters, performing QA, and using the Fulcrum infrastructure to drive data collection logistics.

Some data collection tools on the market take the approach of using the same data structure all through the data management process. This means that data structures designed for storage and analysis are pushed to field data collectors as well.

At Spatial Networks, we’ve long recognized that the behavior of data collection personnel on the ground is different from that of analysts or data managers elsewhere in an organization. Because of this, we optimize our Foresight data collection forms to ease the work of data collectors and then we process the data as it is synced back to Fulcrum and begins moving through our data pipeline.

We refer to this process as “data conditioning.” In many ways, data conditioning looks like traditional extract/transform/load (ETL) and, in fact, ETL tools make up the bulk of our process automation, but we bring other technologies to bear as well. Here are a few that we use:

  1. Fulcrum Desktop to sync data from Fulcrum into our PostgreSQL/PostGIS database and to move media files from Fulcrum to an S3 bucket.
  2. PostgreSQL with PostGIS for data storage, hosted in Amazon RDS.
  3. PostgreSQL full text indexing to enable search.
  4. Several custom PostgreSQL functions, triggered by Fulcrum Desktop, to perform initial conditioning routines, such as attaching ISO country codes, FIPS codes, and grid reference system identifiers.
  5. Safe Software FME to perform ETL to migrate data structures to conform to our production data schemata.
  6. Python. A good bit of Python. Many of our QA/QC routines are automated using Python.

Item number 4 is a good example of a place where data collection differs from data production. Information such as country codes are important information to standardize our data for production and analysis, but it is not necessary that they be provided by collectors on the ground. Requiring this data to be provided in the field would simply slow down collection unnecessarily.

Such items could also be attached in the field using Fulcrum data events or similar techniques, but this would require connectivity or the delivery of needlessly complex data models to mobile devices. Based on our years of experience performing field data collection, this type of reference information is best attached later in the pipeline, freeing collectors to perform their work efficiently, while also producing data that is useful for analysts.

As mentioned above, ETL is a big part of our conditioning process. We make use of FME, both desktop and server (cloud), to build, execute, and automate our ETL processes. The tasks that we perform in FME can be routine or fairly complex. A few of our common ETL tasks are:

  • Database Sync — Ongoing (weekly) Fulcrum mobile data field collection is synced using Fulcrum Desktop to a staging database in PostGIS where the raw data then enters the SNI data pipeline for a series of data conditioning, ETL, and data packaging processes. Additionally, any associated media files are also synced to an S3 bucket for later use.
  • Database Updating — A primary FME workspace performs a query of the staging database for any new, updated, or rejected survey records that have been collected since one week. These records are parsed and isolated then used as input within the FME workspace which performs INSERT, UPDATE, or DELETE Postgres operations to a Production database.
  • Data Conditioning — This workspace also performs a “mapping” of the raw attribute fields as collected in the field to the standardized and “normalized” production database tables. Some field names are changed to represent common natural language terms and fields are organized into a standard consistent view when displayed in a GIS attribute table.
  • Data Packaging — Another FME workspace enables SQL statements to be run against the production database to only select certain datasets or areas of interest as, based on spatial or attribute parameters defined by the end user. This greatly increases efficiency when performing customer specific data packaging and delivery.
  • Media Packaging — Foresight data is often captured with accompanying media for each feature record. We constructed an FME workspace which parses the UUIDs from any media attribute fields of feature records and parses individual media objects, such as photos or spatial video, from an AWS S3 bucket. The media is downloaded and packaged with a data deliverable along with the feature records and a data manifest.
  • Automation — We achieve standardization of data storage and workflows through automation of the previously mentioned FME workspaces via FME Cloud Server. These workspaces are prototyped and tested in FME Desktop, with minimal programming, and once validated they are published to FME server and updated to run on a schedule, free of any human intervention, until it’s necessary to revise or update workspaces with new data and workflow parameters.

Here is an example of one of our FME workbenches. For common, repeatable tasks, a workbench like this will be prototyped on the desktop, then published to the server once it has been validated. On the server, its execution may be triggered on schedule or called from another process, such as Fulcrum Desktop, using the FME REST API.

The conditioning and ETL of Fulcrum data into the data pipeline begins from a staging database. This is where the collected data is “held” until it is processed into the first phase of data conditioning. The conditioning performed is standardizing the metadata and table structures. This is a required set of standards that the data must meet to be considered valid and authoritative for public consumption once through the data pipeline. This is the first wicket along a data’s journey to making it pristine and maintaining data provenance.

Because field data collection personnel interpretation can be subjective, attribute data can require a small amount of attention to detail. Several methods of quality assurance and quality control (QA/QC) are applied to ensure the attribute values are logical and consistent. One such example is standardizing the value of N/A — meaning Non-Applicable, so it is represented consistently. This means when the data is viewed in the production database, any possible variations of “N/A” are alleviated, such as n/a, Na, N \ A, etc. A workflow applying RegEx — Regular Expression is used to achieve this specific data conditioning standard. Other such rules are applied to condition data to a standard representation as it moves through our data pipeline.

A mature data operation recognizes that there is no “one size fits all” data structure that is applicable across the entire data pipeline. The form of data must be matched to the function being performed. The physical realities of field data collection mean that data structures must be tailored to optimize the efficiency of data collection personnel. As a result, field-collected data must be conditioned for further downstream use.

Our Foresight operations team has learned a lot about data conditioning and the process described here is a constant work in progress. Fulcrum is optimized to make data collection better, rather than merely extending production data structures to the field in less-than-optimal ways. Our Foresight team takes full advantage of that optimization to collect, enrich, and deliver data and insight to our subscribers.

Todd J. Pollard, GISP co-authored this post.

--

--