On the material qualities of the data

Jørn Knutsen
Explorations of the Seed Vault
6 min readJul 10, 2017

--

The central material for this project is the database published by the Svalbard Global Seed Vault. The database contains a comprehensive record of all the seed samples that are currently stored in the vault, currently about 950.000 samples. Each entry in the database contain information such as the full scientific name of the seed, the number of seeds, which shelf they are on, country of origin, which institute they belong to, their genus, etc.

In our research and teaching we have been advancing the notion of data as a design material. By that we mean that different types of data can be considered a specific type of substrate that affords certain types of uses and forms. Just like different types of wood have different qualities such as density, weight, natural colour, strength, grain and structure (as masterly highlighted in this project by Siri Yran), and thereby being suitable for different kinds of uses, tools and processes. As such, every set of data also has its own grains, constraints and possibilities.

In is seminal blog post Tom Armitage writes about how it feels like to do material explorations of data, an offers some very helpful perspectives on such explorations. In this post I will try to tease out and discuss some of the material qualities of the SGSV database that we have encountered in our initial explorations.

Static

First and foremost this is a more or less static material. Every month or so there are new deposits into the vault, and these are reflected in the database. The complete database is, however, only accessible through a downloadable CSV-file, meaning that there is no way of querying it computationally and there is no way of knowing about the new deposits without re-downloading the entire database file. (Genesys does have an API through which you can query, but it does not contain the complete record of SGSV. Because politics, nor is it necessarily real time).

But more importantly, a new deposit to the vault doesn’t really change the nature of the data nor does it happen very often. For the purpose of this project that means that we do not need to work on live data, and the programatic perils that follows with it. And secondly, the stories that we may want to tell will not be related to the realtime nature of the database, but rather a specific snapshot in time.

Error prone

Every dataset is riddled with its own unique host of flaws, idiosyncrasies and biases. All data goes through a process of capture, entry, organisation, storage and retrieval. Each step in the process may taint the data in different ways.

The Seed Vault’s data is no exception. It contains a number of obvious flaws (and probably some that we do not yet know about) that we need to consider in order to be able to figure out what they imply for what can be inferred, what stories can be told, what can be made and how we may circumvent the errors.

A screenshot of the Svalbard Global Seed Vault data submission template. Retrieved from https://www.nordgen.org/sgsv/index.php?page=depositor_guidelines

Much of the errors that we have encountered in the database so far, we believe, are related to how data is submitted and entered into the database. Depositing into the Seed Vault is a laborious and manual process. You can read more about it in their depositor guidelines. When depositing, the depositor needs to attach a form in Excel with data about the submission. These entries are not sufficiently standardised nor computationally validated, and thus opens for a fair bit of individual interpretation and basic human failure or sheer laziness in the entry process. For example some entries will have a very exact number of seeds, while others just an estimate, but there is no way of know which is what. Some will have an exact date for regeneration while others just the year, or some none at all.

Moreover, many entries are also lacking certain fields, such as their country of origin. In our initial explorations we have often encounter seeds from the country Unknown (In fact almost 7% of the seeds do not have a proper country of origin). This may also be due to entry or processing mistakes, but it might also be due to the fact that the data is literally unknown. Many of the seeds in the vault are as old as 40–50 years. It is not unlikely that information has been lost to the passage of time or simply never was recorded in the first place.

How the quality and persistence of the data are shaped through history and socio-cultural practices is an interesting finding and story of the project in itself. Yet, what it entails for us in the design process is that we need to develop a deep material knowledge of the data in order to understand these qualities, and develop strategies to either filter out the errors or highlight them in the final outcomes.

Size

950.000 entries in “the era of big data” is not that big. We are able to download the whole database to our laptops and open it as a spreadsheet and have a look around. This gives us the opportunity to do simple initial investigations and low fidelity prototyping without firing up a more complex database query language to interact with the data.

Even though the dataset is fairly small in size, it is also so big that it is hard to get an overview and understanding of it just by glancing at it. It is a size that affords digging around and iterative testing and prototyping.

Screenshot of the database as a spreadsheet. Note the lacking persistence in date and country fields.

Affordances

In his blog post, Tom Armitage talks about the affordances a data material might have, meaning what are obvious hooks into related datasets, applications or processes.

One obvious hook for the Seed Vault’s data is Genesys, which is a free online global portal which allows the exploration of the world’s crop diversity through a single website. Genesys holds and makes accessible data from almost 500 different holding institutes around the world. Data from the Svalbard Global Seed Vault is not stored in Genesys, but accessions that are stored on Svalbard are often marked as such. However, to further complicate matters, SGSV also stores data from institutes that are not part of Genesys.

By looking up accession ids stored in the SGSV database we are able to find if they are stored in Genesys as well (a huge thanks to Matija Obreza from Genesys, which helped make this API endpoint available to us). Genesys is potentially a very rich database that contains detailed information about every accession, down to the minute physical characteristics of the seeds and the latitude and longitude of the accession spot. However, this information, which is quite interesting and useable, is not available for most accession in the database.

In this post, we have pointed at some of the defining material qualities of the data in the Svalbard Global Seed Vault as we see it. They reflect the purpose and organisational nature of the data. We need to remind ourselves that the purpose of this database is primarily logistical; it exists to give the institute an overview of what it is in the vault, who owns what and where it’s placed. Our task is to dig out the hidden stories and make them accessible to a wider audience in a new and interesting ways.

--

--

Jørn Knutsen
Explorations of the Seed Vault

Designer, researcher and educator at the Oslo School of Architecture and Design. Founder of the design studio Voy.