Eight lessons for working with data

Data from the Forest 500 platform.

Denis Alder

Working with data — as many members of the Supply Chain Transparency Network do — can be challenging. A misplaced decimal, a mis-read cell in an Excel spreadsheet, changes in definitions, unorganized and undocumented workflows, any of these can lead to frustration and hours or even days of wasted work. Global Canopy’s Denis Alder explains how to avoid some of the headaches of working with data.

1. Preparation is key

Time spent in preparation and planning is never mis-spent. For a data collection exercise, this involves reviewing and clarifying the objectives, the definitions of terms, data sources, methods of data collection, and the resources needed. Copyright, data protection, confidentiality and ethics must be considered. The volume and type of data guides the best tools to use. A data pipeline should be sketched out that considers the hows and wherefores of acquisition, storage, data security, updating, cleaning, analysis, and outputs as useful information.

2. Understand the relation of your data to the whole

The data that is collected has to be related to the whole that is to be reported via a data model, which is a mathematical description of the gaps to be accounted for, the expansion factors needed, weights that have to be applied and so on. The data model may include extrapolations, interpolations and sampling factors and distributions. It will often be refined and improved as a study progresses and the data sets are analysed, but from the beginning, the basic equations and principles must be devised. Collecting data without a carefully worked out and explicit data model is a recipe for wasted time and money and meaningless results.

3. Be aware of data types and structures, and how to handle them

Data can be obtained in myriad forms. They may be measurements, counts, names for specific attributes or qualities drawn from a pre-set list; lists of companies, places, or peoples names, natural language texts, tagged text such as a website, tables in row and column format. Data may be collected as a one-off, periodically over weeks, months or years, or continuously from a feed or sensor, such as satellite data or weather instruments. Data can be spatial (GPS or map data), temporal (dates, times), or multidimensional, combining several types into one data object. All these types need different tools to collect, store, check and use them. Using the wrong tools results in backlogs, cutoffs, data loss, error and confusion.

4. Data entry and cleaning are crucial phases for both cost and accuracy

Data must be validated and checked as it is entered. This is often ignored, resulting in large data sets full of errors that must be painstakingly cleaned up. However, some errors will always arise, so a method of cleaning is required to detect errors logically and mark or correct them automatically. Free tools such as Google’s OpenRefine are very useful, but often bespoke apps will need to be developed. Manual corrections are slow and costly and to be avoided if at all possible. Some errors can also be handled implicitly in the data model, as occurs with statistical sampling, and results will be presented as an average with a confidence interval, rather than as single estimate.

5. Use an appropriate data management system

The accessibility of spreadsheet tools such as Microsoft Excel mean they are often used as the first step in collecting data. Some common bad practices become increasingly problematic as the dataset grows in size and complexity. These include:

● Inconsistent and disorganised file and directory names

● Multiple copies of the same data in different versions of uncertain genealogy.

● Lack of data checking as it is entered or pasted in.

● Mixing up data with analysis, so calculated results cannot be distinguished from original data.

● Pasting values over original formulas to hide processing steps.

● No commenting or documentation of processes performed on the data

● Mistakes in the range references of formulae, leading to wrong results, very commonly occur, especially as ranges are sorted, moved, inserted or deleted.

Spreadsheet tools are therefore problematic for large datasets, and an evolution usually occurs from spreadsheet-based systems to database packages such as PostgreSQL or Microsoft SQL Server as a project scales up. This is driven by the excessive costs of the errors and data handling with spreadsheet systems. This requires an investment in specialist skills, but is offset by the resulting efficiencies.

6. Reduce the data to simple, clear and relevant information

Massive data sets with vastly complex structures are of no use in themselves. The data must be reduced to summary tables and infographics, and key messages highlighted in clear language. There is a data pipeline from raw data capture, cleaning and storage through to analysis and final outputs. As outputs are periodically updated, this analysis phase will need to be re-run. It can be done by a standard packages such as Tableau, or by bespoke macros or apps using common statistical, graphical and mapping tools.

7. Provide outputs in the form most useful and relevant to your audience

The target audiences for the information need to be understood and engaged, so that what is produced is relevant to their needs. There may be several levels of audience, requiring different outputs, from short, general, simple reports with clear messages, to data rich products such as thematic maps and large, downloadable tables of statistics.

8. Document, explain, annotate

Within each of these seven steps, documentation, written methodologies, commenting of workbook cells and computer code must be prioritised and time allowed for this. The worst practice is to rush through a data project without any documentation of the steps and methods. This is all too common, and leads to systems that break down when key staff leave.

Bearing these eight points in mind can save a great deal of time, wasted effort and indeed, distress and regret. The old proverb “More haste, less speed” applies as much to the management of data as it does to most other fields of human endeavour.