Green Coding 2/8: Principles of Sustainable Data Management

Thierno Diallo
Just-Tech-IT
Published in
8 min readMay 13, 2024

As promised in the first article introducing green coding and its principles, here’s the second article in the series of 8 articles. Considering the ever-growing presence of data and its utilization in our daily lives, addressing the management and processing of data seems necessary to enhance our sustainability.

In the following article, we will briefly discuss why having a sustainable approach to data management is necessary, the fields covered by this approach, and the practices that allow us to move towards this sustainable approach.

Once again, I want to clarify that even though I use Java and .NET languages as examples in my articles, the principles and concepts are completely language-agnostic. They are applicable everywhere.

1. Why We Need to Manage Our Data Sustainably

It’s no surprise that the reason for sustainability and reducing environmental impact is essential. Below is an article on the data volume growth forecast between 2010 and 2025: Data growth worldwide 2010–2025 | Statista.

However, it’s not limited to that because we have an economic aspect in reducing storage and processing costs. I’m sharing the following articles on the economic costs of hosting:

Green IT: How to make your IT infrastructure more sustainable — Leasétic (leasetic.fr)

Reduce IT infrastructure expenses: boost efficiency today! (finmodelslab.com)

The complete guide to cloud storage pricing | Veritas

Microsoft Word — gouverancedesdonnéesvf.rtf (core.ac.uk)

A portion of this reduction could be invested to enhance the security, privacy, and integrity of data through the implementation of high-quality infrastructure that meets standards.

We can also discuss the adherence to ethical and regulatory standards for data storage and processing.

All of these benefits combined contribute to creating a more sustainable ecosystem in the long term.

2- Fields are covered in this article

In this article, we will mainly cover sustainable management practices in two key areas: storage and hosting, and processing.

a- Storage and Infrastructure

i- Choice

The choice of location and type of data hosting is a critical decision that should not be taken lightly.

Not only does it have a significant influence on system performance, availability, and security, but also on the system’s environmental impact.

For a choice to be relevant and efficient, it is essential to carefully analyze and assess the need beforehand.

For instance:

  • The type and level of data confidentiality to be stored: this guides the decisions on where and how to store and secure the data (On-Premises, Public/Private Cloud, or Hybrid);
  • The workload associated with processing this data: addressing the issues of database sizing and power;
  • The type and frequency of access to this data.

Cold Access: Infrequent and slightly slower access, this type of storage is more cost-efficient and resource-friendly. However, it should only be used for data that is not needed in real-time.

Hot Access: Fast and frequent access, this is a more expensive choice economically and consumes more resources due to the components used. It should be chosen only if your system requires fast and regular access to the data, for example, for real-time or near-real-time needs.

  • The data retention period;
  • As well as the tolerance level to failures: addressing the subject of resilience patterns to be implemented.

ii- Best Practices

To ensure security, reliability, performance, as well as economic and environmental efficiency, it is crucial to implement best practices for data storage.

Here are some recommendations:

  • Determine in advance the type of data to be stored, the volume, sensitivity, and update frequency.
  • Choose the storage solution according to the need; RDBMS: for structured data.

SGBD : for structured data

NoSQL: for semi-structured data, and horizontal scaling

Object Storage: for binary files and objects

  • Regular backup practices (while not forgetting to delete old backups) and archiving;
  • Define the data retention duration and policy (application and logs);

This allows for data purging, freeing up storage, reducing the data to be processed, and consuming fewer resources (Memory/Network)

  • Optimize indexing;

Speed up processing and consume fewer resources (CPU/Memory)

  • Scale the infrastructure: for highly costly mass processing, opt for increased power as needed and automatic scaling down at the end of processing;

This allows us to use only the resources we need

  • Configure the appropriate log level;

On log collection and retrieval systems like “App Insight”, we pay for writing. Therefore, having an inappropriate log level (debug) on a system is not only useless, but also a disaster in economic and environmental terms.

b- Data Processing

i- Data Compression

The practice of compressing data before backup is a highly effective approach.

This method reduces the data size, consequently decreasing the storage space used, as well as the network bandwidth required to transmit it. However, compression alone is not enough; it must be done efficiently.

Below is an example of two different levels of compression efficiency in Java.

ii- Deduplication

To reduce the need for storage, CPU processing time, disk I/O, and our energy consumption, we can implement deduplication mechanisms that avoid storing identical data in different locations. In other words, it is the process of removing duplicates or redundant data.

Deduplication can be performed in various ways, depending on the requirements.

Here are a few simple examples of deduplication in Java.

Example 1:

Example 2: by using the appropriate data structures, in this case Sets, we can easily eliminate duplicates

Data deduplication can be used for file processing as well as for database management, to name just two applications.

It’s worth noting that the use of the Stream API also allows for deduplication through its distinct() method.

For more complex needs, we can also mention third-party libraries such as ‘Apache Compon Collections’ or ‘Google Guava’ which can do a great job.

Limitations of deduplication

Among the risks, we can cite the potential loss of data because the data is not duplicated.

However, we should also consider the resources (CPU/RAM) used for this processing.

Therefore, it should be used after analysis to properly choose the type used and the frequency.

iii- Intelligent Caching

This mechanism involves temporarily storing data in memory in a close and organized manner in order to speed up retrieval.

The significant advantage of intelligent caching lies in its ability to automatically manage data updates when there are modifications, or to periodically refresh to avoid using outdated data.

We can also mention the advantage that caching brings to improving performance by avoiding the repetition of costly calculations, or by reducing the load on data sources, and consequently, network traffic as well.

Example: we will illustrate a simple case of intelligent caching in Java using the Caffeine library.

Limitations of Intelligent Caching

While caching has numerous advantages, it also has some disadvantages or at least some challenges. These are challenges that need to be acknowledged in order to better address them.

Among the potential challenges, we can mention the complexity of managing errors and expirations, managing cache saturation and data obsolescence, and network traffic peaks during retrievals.

iv- Stream Processing

When we need to process a large dataset, using the Stream API is more suitable than using a traditional loop.

The advantage comes from the fact that the Stream API exposes a number of operations such as ‘filter, map, reduce, sum, etc…’ that are highly optimized for mass processing.

It’s also worth mentioning that Streams in Java use a lazy processing approach, meaning that elements are processed on demand.

The ‘AutoClosable’ feature also ensures that resources are released when processing is completed.

Example 1: using the Stream API on objects

Example 2: with the stream api on data file

Limitations of the Stream API

For very large datasets or for large distributed processing, it will be necessary to consider frameworks or libraries such as ‘Apache Hadoop or Apache Spark’ which, by their design, are capable of efficiently handling massive volumes of data.

v- Query Optimization and Adding Indexes in Databases

For all systems working with a database, performance is strongly impacted by the database’s performance and the quality of the network.

While having an excellent network can improve system performance, adopting practices to enhance database processing performance is an essential ally.

Among the performance improvement practices, we can mention adding indexes to the database, regular database maintenance, and query optimization.

Tools exist to propose database optimization approaches, index creations to be done, following the analysis of queries that the tool sees passing through its query plan.

Indexing:

Below are 2 examples of adding indexes, with SQL and with ORM.

  • With SQL
  • With ORM

We have an index on the ‘name’ column in both cases.

Query Optimization:

There are always multiple ways to write a query, which leads to the question of the pure performance of the query written, as we would do for any algorithm. Therefore, the queries we write have a great responsibility for the performance of our system.

Example:

3- Conclusions

To reduce environmental impact, lower infrastructure costs, ensure data reliability, security, and confidentiality, comply with ethical standards and regulations, and effectively utilize the resources at our disposal, sustainable data management throughout its entire lifecycle, from processing to hosting and dissemination, is a highly viable long-term approach.

In my view, sustainable data management is crucial for designing efficient, environmentally responsible, frugal, and environmentally respectful information systems.

While awaiting the next article, I would like to share with you some initiatives in the field of GreenIT. Feel free to read, comment, experiment, and provide feedback.

I invite you to join me in a month for episode 3, where we will explore various tools that allow us to measure and quantify the green aspect of our software, and identify areas for improvement.

--

--

Thierno Diallo
Just-Tech-IT

Staff Engineer/Technical Leader And Green Champion at Axa France