Prioritizing Data-Oriented Design Paradigms in Your Code
A more mindful approach to software design
When software developers consider what good software design looks like when working with high-level programming languages, many minds instinctively wander toward the guiding principles of object-oriented programming (OOP).
We lean so heavily on this paradigm not only because most of us were taught this way, but also because an object-oriented approach — structuring code based on objects in our perceived model of the real world — is an intuitive and generally good way to design software when working with object-oriented programming languages. Encapsulation, abstraction, inheritance, and polymorphism are four very important principles in software design that work nicely within the object-oriented paradigm while providing high levels of code reusability and maintainability. The merits of object-oriented programming are numerous and well-documented.
While there is nothing inherently wrong with object-oriented design, developers can program themselves into a box because they have constrained themselves within the dimensions of an object-oriented world. Code that is modeled well from an object-oriented perspective can actually disguise the fact that the code is not solving the problem at hand in an optimal way, ignoring the purpose behind writing the code in the first place. Taking an approach centered around how the code is really working to transform program data — a data-oriented approach — can help programmers avoid the pitfalls that object-oriented code can conceal.
An Introduction to Data-Oriented Design (DOD)
When taking a strictly object-oriented approach to writing software, we tend to lose sight of the goals underlying our code.
Code is a step-by-step guide through which humans can explain to a particular computer system how to solve a given problem in an efficient way. Some problems are obviously more complicated than others, but almost all problems solved by software boil down to the manipulation of data from one form into another. Code is merely the tool that programmers use to solve this data transformation problem. To solve this problem in the most optimal way, programmers must have an understanding of how their code affects the hardware platform that they are targeting. This part is often ignored by programmers in favor of prioritizing an object-oriented paradigm.
A common example of this is how programmers write their code without consideration for how their program is utilizing the CPU cache. Caches allow a computer to take advantage of data locality to greatly increase memory-access efficiency; however, if the programmer does not write their program in consideration of data locality, they are missing out on this key performance boost.
Data-oriented design is the idea of writing code to create data structures optimized for efficient processing rather than writing code to structure data around real-world objects. This means that data and functionality are separated so that functions can act on data in a more general form. This makes it easier to optimize performance through cache utilization and parallelization.
For most applications, these performance advantages are relatively minor, but for problems where speed is of the essence, a data-oriented approach to problem solving is paramount. But what do these ideas look like in real code?
Data-Oriented Design: Example in C++
(The code I wrote for this example is available here on Github. Please note that it is not intended to be used for anything other than to showcase differences between software design paradigms)
This example looks at two different implementations of a simple stock portfolio management system in C++ . The first code snippet shows a class structure consistent with object-oriented design principles.
The key data member in this class is the stock portfolio which consists of a vector of stock position structures. Each stock position consists of a stock symbol, the current value of a share of that stock, the number of shares owned, and the average cost paid for each stock. Modeling a stock portfolio as a list of stock position models is a very intuitive way to structure this code. However, performing computations on this data structure requires looping through the stock portfolio and accessing the 56-byte memory chunk for each stock position, even when the computation only needs a small piece of information about each stock position.
Alternatively, a data-oriented design of a stock portfolio system might look something like this:
The key difference in this example is that rather than using an array of structures (AoS) like we had in the first example, this class uses a structure of arrays (SoA) to store all the data associated with the stock portfolio. This allows us to perform computations that iterate over arrays of smaller data members which can be more efficiently cached. This can significantly boost the speed of the application in which this class is used.
How does this affect the performance of computational methods?
An implementation of computational methods in the object-oriented version might look like this:
Meanwhile, an implementation of computational methods in the data-oriented version might look like this:
When the total return is computed on a stock portfolio consisting of 1500 stock positions, the data-oriented implementation outperforms the object-oriented implementation by a factor of 2 on average (GCC 7.5.0; Ubuntu 18.04 64-bit; g++ -O2). Even for such a small piece of code, the change in the structure of the code resulted in a significantly more efficient system because the hardware had more direct access to the data. Based on this example, it is easy to see how object-oriented code written without regard for the underlying hardware can result in an inferior system, even when the code structure seems sensible.
Yes, DOD and OOP can (and should) coexist
This is because data-oriented design and object-oriented programming are not mutually exclusive. Data-oriented design can be approached as a programming philosophy rather than a specific design framework for code. It is characterized by programming with a primary focus on solving the data transformation problem before even considering code design. Once the programmer has decided on how the hardware should interact with the data, the code can then be designed around this data.
From here, the programmer can begin to analyze the trade-offs between different object-oriented models. Often, it is reasonable (and preferable) to sacrifice imperceptible performance differences for a more maintainable and/or reusable codebase, but the programmer can only understand the value of this trade-off if they understand how it affects the relationship between the hardware and the data. This can only be done if the programmer takes a data-oriented approach to writing their code.
Conclusion
When writing code with high-level programming languages, it is easy to get lost in the convenient levels of abstraction afforded to us, and this leads us to forget that our code is responsible for telling hardware how to transform data in a very specific way at a lower level. A data-oriented philosophy leads to code that is more fundamentally sound, and it forces developers to truly understand what their programming language is doing under the hood. Being conscientious of data-oriented design may not radically change the way you write code, but having a better understanding of why it is important will almost certainly make you a better developer.
Thanks for reading! I also want to give a shout-out to everyone in my Advanced Topics in C++ class for reviewing the first draft of this article. Your feedback was super helpful in the creation of the final version.
If you want to learn more about data-oriented design, I highly recommend checking out some of the resources below.
Additional Resources
- Noel Llopis — “Data-Oriented Design (Or Why You Might Be Shooting Yourself in The Foot With OOP)”
- Richard Fabian — Data-Oriented Design Book (free, online, reduced version)
- CppCon 2014: Mike Acton “Data-Oriented Design and C++”
- CppCon 2018: Stoyan Nikolov “OOP Is Dead, Long Live Data-oriented Design”
- CppCon 2019: Matt Godbolt “Path Tracing Three Ways: A Study of C++ Style”
- Feel free to suggest any resources that I should add!