The Dark Side Of Data

Alan Pointing
TotalEnergies Digital Factory
8 min readNov 22, 2022

The advancement in technology and the drive to digitally transform private and public companies is powering an exponential explosion in the volume of data.

According to statistica.com, this year the world is expected to generate 97 zettabytes (or 97 trillion gigabytes) of data and by 2025 this could double to 181 zettabytes.

How big is a zettabyte compared to other units?

Lurking behind these statistics is a worrying concept that will only get (exponentially) worse as the volume of data being collected and generated grows. This is the concept of “dark data”.

Dark data is defined as “the information assets organizations collect, process and store during regular business activities, but generally fail to use for other purposes (for example, analytics, business relationships and direct monetizing)” (Ref. 1). Examples of dark data are transaction data that is not analysed or has no further use; computer logs; historical spreadsheets and other past project working documentation; or data from internet of things sensors that are not used or will never be analysed again; and on a lighter note, your own photos that you would never want to see again!

Dark data also includes lost data

In my personal view it also includes data that has no context and essentially gets lost in the storage of personal computers or business storage repositories. Without this context, it cannot be found or if it is found by accident, it cannot be relied upon as it is of unknown source or usefulness.

The costs and energy consumption of Dark Data

This is worrying because all this data is stored on IT infrastructure that requires energy to run it. In 2020, it is estimated that digitalization accounts for 4% of global greenhouse gas emissions and by 2025 data centres alone could consume no less than a fifth of global electricity (Ref. 2). Some of this energy is being used to store dark data that is either lost, will never be used again, or does not have any usefulness.

Another important overhead that people often forget about is the extra data that is stored to protect against hardware malfunctions or disaster recovery purposes. For example, on cloud platforms data will be backed up or replicated in at least one other location and separate disaster recovery sites may exist with a full copy of the data. This all adds to the volume (and costs) of data being stored.

As well as being wasteful in costs for companies and individuals alike, maintaining this organisational or personal memory will be a hidden challenge in the future in the drive to achieve net zero and reduce carbon footprints.

Why hasn’t the problem of dark data gotten the attention it deserves and a focus in companies’ IT climate change policies and programmes? Maybe it is the exponential nature of the growth, whereby it has been creeping up on us over a number of years and is now reaching a level that can’t be ignored. Maybe it is people’s view of the process of digitisation, which is usually seen in a positive light, as a process of improvement and achieving better (cost) efficiency overall, rather than increasing digital costs. Or maybe it is the increasing complexity of digital solutions, such as mixing cloud and on-premises IT infrastructures, making the tracking of digital costs more fragmented and difficult to calculate and reconcile across platforms. Finally, it could just be the fact that companies have no way of telling how big the problem is or whether there is a problem at all.

A practical example

To illustrate this problem of dark data, I go back to my involvement in a data migration project, to transfer the historical data of a company that was acquired by another. The company that was being acquired had data going back 50 years, data that had been collected and generated by hundreds of company employees and its operations, covering different phases of the company’s growth and covering various geographical locations.

The data volumes involved were 2.5 petabytes of data and over 320 million files. The data migration project took over 2 years and was phased according to the importance of the data. The most important data was transferred first and involved the active operational and project data that had to be continued by the acquiring company. This was relatively straight forward to find and transfer, as the people using the data were still retained by the acquiring company and knew where the data was and what to transfer. A lot of this data was structured data as well, being part of applications and databases that were easily incorporated into the new IT environment. However, what was left was the structured data that no-one knew about since previous employees who know about them had left, and also unstructured data that had little or no context. For the unstructured data, attempts were made to extract metadata from keywords, filenames, folder names to add some context and supplement people’s individual knowledge about previous campaigns, projects or geographical locations, to decide what to do with it.

Dark data that couldn’t be identified in the company being acquired

By the end of the project — of the possible 2.5 petabytes of data to be transferred, only 1.5 petabytes of data were migrated and of the potential 320 million files to be transferred, only 160 million files were migrated.

The volume of data that was not transferred can be seen as a representation of the dark data of the acquired company. If we estimate that the 1 petabyte of dark data was built up over 10 years, this amounts to an estimated (lost) IT cost of 1.5m USD (not including backup and disaster recovery costs). In addition, this dark data is estimated to have generated 10,400 tons of CO2 (Ref. 3) over 10yrs!

At the heart of the matter

In order to identify and effectively manage dark data and its impact, all data should be assigned a context and a classification at the start of the lifecycle of data (collection and generation stages), and the context updated throughout the following data lifecycle stages (e.g. processing and storing).

By adding contextual information and metadata to data objects we can then make more informed decisions about the source of the data, its usefulness and how long it needs to be kept.

Moreover, context and metadata are key to searching for data and navigating to the right data within “the noise” to realise the value of data.

Tools for cataloguing and classification

In the past, tagging and adding metadata to data has been a laborious and thankless task, with relatively few tools and metamodels to assist in the process. The vast amounts of data being collected and generated today, means that these methods are no longer sustainable, and more automatic and smart ways of classifying data must be found and implemented. Data management platforms, data cataloguing solutions and other solutions based on the concepts of DataOps, are being challenged to keep up with this data explosion.

On cloud data platforms, native data cataloguing solutions exist to identify, track and manage data objects arriving on the data platforms. On these platforms metadata can be attached to objects, which can be searched for by their tags and managed according to the stage of their data lifecycle — for instance deleting data objects according to a data retention period tag. Appropriate tags could also be used to archive data or move data to lower cost storage layers or media.

The automation of cataloguing

As well as native tools, many 3rd party data cataloguing solutions exist to assist in the identification and classification of data objects. Most of these tools up to now have given a static view of the data objects either through one-off and regular scanning of data objects, or even manual population. The usefulness of this static view has been called into question as it is difficult to sustain for large numbers of data objects and doesn’t always add value in the day-to-day operations of people using the data.

Now, a more active view of metadata is being advocated using the concepts of DataOps, where the identification of data objects in their associated dataflow / data pipeline can be dynamically captured and their status monitored in near real-time (data observability), to assist in the identification of dataflow and data quality problems.

The human side

As well as the technology aspects of dealing with dark data we can’t forget the human and cultural side of people handling data. Our natural tendencies are to hoard data in case it could be useful in the future, or keep it and even make copies of it in case someone else loses it or corrupts it, or retain it as it gives us a sense of well-being i.e. “knowledge is power”. These individual tendencies are hard to change and when scaled up to a whole company say, are a formidable challenge to overcome.

Our tendency to hoard data

Companies need to adopt cultural change programmes for their employees in how to properly handle data, alongside clear and practical policies and standards for people to follow. An example of best-practice in this area, is to organise regular data cleaning campaigns or days, which are great events to raise awareness and give dedicated time to employees to clean data that they have the ownership of or knowledge about it.

Wrapping it up

It is often said that data is a responsibility for all of us, as we are all consumers and producers of data in our private and work-lives.

We need to reduce our data, cost and CO2 footprints

So by inference, we all have a personal responsibility to try and minimise our data footprint by adopting good data practices and in so doing minimize the cost and carbon footprint of dark data, for ourselves and for our companies. Hopefully, such approaches and methods will be effective in minimising the amount of dark data, so that the hidden force of dark data, will no longer be a strong force, but a weak one.

Bibliography

  1. https://www.gartner.com/en/information-technology/glossary/dark-data
  2. ‘Dark data’ is killing the planet — we need digital decarbonisation (theconversation.com)
  3. Costs of digitalisation to society, industry and the environment — Digital Decarbonisation

--

--