Data Processing Programming (4): Principles in Action

For programmers, data engineers, data scientists and anyone who programs on data

Nazar Merza
8 min readSep 18, 2020

Now that we have some tools — principles and concepts — at hand to work with, it is time to begin by applying to concrete examples. It is worth noting that the purpose of these examples is only to illustrate the programming point and not their particular business content or meaning.

Note: Examples in this section are chosen from SQL because they better illustrate the idea of structural code complexity. But, it is not about showing SQL coding techniques etc. The purpose is to present general programming ideas, not tied to any one language.

In the examples we start with a problematic code, explain what the problem is, and then reason and show how they can be changed or improved. Problems or complexities are resolved through various methods, mostly based on the principle of Separation of Concerns (SoC). Also, in the process more and new engineering and programming concepts are introduced.

Example 1

Look at the following code and before reading further explanations, try to understand what it does and if there is any problem with the code; and if so, what it is.

Code from DEV

Well, the above code is simply not easy to decipher, at least not without serious effort. In fact, once the query is understood, this query does not really carry any complex logic and for that reason it should be much easier to understand. Namely, it extracts some employee-related data from a number tables and does some transformation and formatting on the extracted data. But, what makes such a simple logic difficult to understand? Well, the problem with this code is that it does too many things at once which makes code hard to read. Readability was earlier given as a criterion for code quality.

Doing too many things at once — in the same function, step, program or unit- is one of the most frequently occurring coding problems in data-programming (actually in any programming) which results in complex coding structures (bad code).

Code which is hard to understand, is hard to test, change, and maintain. It is prone to errors and when error happens, which almost always does in such cases, is hard to find.

Anyways, going back to the example code, specifically it mixes two things together: Data extraction and data transformation (two concerns). Realizing this fact and separating two concerns, greatly changes and improves the code. Hence, the first step for extracting data looks like this:

The new code is much better (though still not optimal):

  • Its intent is clear and definite: extract employee data.
  • The relation between fields and source tables are clearer.
  • It is easier to see what kind of information is extracted and their grouping: employee personal info, job info, job history info, location info etc.

How to identify separate concerns in general?

But, before moving on further revising this code, let’s pause and ask: in general how to identify separate concerns with any code. In object-oriented programming (OOP) code is partitioned into units called class. One software design rule for class is called the Single Responsibility Principle. According to this principle, each class should do one thing and only one thing. This design idea can be generalized to any programming beyond OOP. One method to go about this is to provide a descriptive statement about the unit under consideration, such as “what does this unit do”. If the statement points to only one thing or action, then the unit has one concern. If it points to more than one, then there are potentially more than one concern. Applying this rule to the original example code: it extracts data and transforms it — does two things.

Further revision

The revised code for extracting data, though clearer and cleaner than the original, still not good enough. Namely, it has a giant JOIN clause with too many tables. This kinds of structures, although some mistakenly view them as advanced coding, in fact is a programming problem which I would like to call the Illusion of advanced coding. This illusion, quite often results in poor code quality or sometimes complete project failure. (complex queries is one of the major problems with data-warehousing projects and their failure).

The above join can be made simpler by breaking it into multiple steps. In some cases it is more clear how to divide joins, e.g. when tables belong to clearly separate logical groupings. For example, if in a query three tables contain customer-level and two other contain account-level info, then it can easily broken along this line into two units. In this particular case, for instance, the code can be divided into current employee info, historical job info, job location info etc.

Another thing which makes this code complex is that tables in the JOIN appear multiple times under different with semantic roles. For example, table EMPLOYEE is used once to extract employee info and another time to extract employee’s manager info (since manager is also an employee after all). Perhaps this is a bigger source of complexity and separating different occurrences of the same tables into separate queries make it much easier. Perhaps something like the following:

But, as it was emphasized already, these ideas about code design are not limited to SQL or any one programming language. It is not that only large SQL join is problematic. Big code constructs, in general, is a code smell and to be avoided. Code smell is an important concept in programming.

A code smell is a surface indication that usually corresponds to a deeper problem in the system.

With the data extracted, the original transformation is encapsulated in its own unit as follows:

Refactoring

In the above example, the method to simplify and improve the code is called Refactoring and it’s a very important concept in programming and software design. According to Martin Fowler, the originator of the concept:

“Refactoring is a disciplined technique for restructuring an existing body of code, altering its internal structure without changing its external behavior.

Its heart is a series of small behavior-preserving transformations. Each transformation (called a “refactoring”) does little, but a sequence of these transformations can produce a significant restructuring. Since each refactoring is small, it’s less likely to go wrong. The system is kept fully working after each refactoring, reducing the chances that a system can get seriously broken during the restructuring.”

According to another definition,

“In computer programming and software design, code refactoring is the process of restructuring existing computer code-changing the factoring-without changing its external behavior. Refactoring is intended to improve the design, structure, and/or implementation of the software (its non-functional attributes), while preserving its functionality. Potential advantages of refactoring may include improved code readability and reduced complexity; these can improve the source code’s maintainability and create a simpler, cleaner, or more expressive internal architecture or object model to improve extensibility.”

There is one important point though that is not included in these definitions but In my view should be added. And it is the idea that with Refactoring, it is important to consider two aspects of it, or two meanings of it: the literal meaning and the logical meaning. In the literal sense, it is applied to existing, already-written code to improve it. But, in the logical sense it does not have to be that way. That is, the code should not be written badly first and then improved through refactoring, rather it has to be written with the refactoring already in mind, such that it should require the least amount of refactoring (properly designed) and ideally no refractory at all (in principle).

To recap and further clarify the ideas discussed so far:

  • Code Complexity was the problem, as expressed in the original code example.
  • Refactoring is a method to resolve this problem or improve the code.
  • Separation of Concern is one of the main principle by which the method of refactoring mainly, or any solution for complexity, works.

Example 2

The following example further illustrates and reinforces the idea of structural code complexity and why it is a problem. To begin with, like the first example, try to read and understand the code as is, without any extra information or context.

Code example from DataQuest

You may ask, how can one understand a code without explanatory information, without someone already familiar with the code explaining to me what it does. True, this is the kind of expectation that exists in practice in most places and cases, but it is simply wrong!

When someone inherits a code base or gets to work with it, in general, the original author is not there to explain the code. It should not be assumed as part of any programming. The right way is, really, to assume the exact opposite and write the code to be as self-explanatory as possible. Code, written according to programming principles has to be readable and expressive.

Back to the example code, like the first example, it is not easy to understand what it does. One could possibly figure it out with some toiling, but why should that be the case? What happens if one gets code like this that goes over many pages, perhaps many tens of pages?

Anyways, this code uses credit_card_complaints table, which as its name implies keeps data about credit card complaints. The records contain: company, state, zip and complaints ID and some other info. The query is supposed to get: For each company, find the state/zip code(s) with the highest number of complaints.

Well, based on this logic, this is a straight MIN/MAX problem which is very common in data analysis and should not be difficult or complex at all. Hence, the complexity of the given code does not come from the logic it implements but from the way the code is written.

Breaking the problem into simpler steps, one can construct the following algorithm:

  1. Sum the number of complaints per each company/state/zip.
  2. For each company, find the highest count from step 1.
  3. For each company, get the state/zip ( 1 or more ) associated with the highest count.
Note: It is absolutely important in programming to design before coding. If this principle is followed properly, it will fundamentally change the resulting program or software. Design, does not mean extensive drawing, diagrams etc. It could be as simple as the above 3-step algorithm, before coding. Once one begins coding, one loses the visualization ability which is crucial for design. 

Here is the revised code, in three steps:

Unlike the original complex query with many sub-queries, this new algorithm breaks the problem into three sequential or linear steps each of which is simple and independent.

Complexity and Code Size (lines of code)

In general quality of code should not be measured in terms of lesser lines of code. As was shown in the above examples:

N linear 1-dimensional constructs are better than one N-dimensional construct.

In other words, degree of complexity of a code structure depends on the number of concerns implemented in the same structure. The more concerns in the same step, the more complex it is.

To justify this statement, consider the two versions of the example and compare them in terms of their testability. In the revised code, each step does only one thing which is easily testable.

Final Notes

In both examples their actual logic is simple, yet hard to understand. If complex computation is added to them (as it is the case in the real world) then it can become even more complex. Therefore avoiding complexity is very important in programming.

<< Previous: Data Processing Programming, a Software Engineering Approach (3): Separation of Concerns

to be continued …

Originally published at https://objectacademy.com.

--

--