Data Definitions: Beyond the Classroom

Published in

sandboxnu

7 min readJan 29, 2020

Motivation

As projects grow and scale, adding new developers, more code, or more time to the process of developing software, so does the logistical complexity of the project. While developers introduce new ways of thinking about and writing code, every function written becomes another function for developers to understand, utilize and support for the duration of the project, greatly increasing the project’s complexity. Initial effort invested in designing data makes code not only easier to write, but easier to communicate as well. With well-defined data, we provide and enforce guidelines for new code and new developers to smoothly integrate with the software already written — contributing to an infrastructure that can be relied upon as the project ages and grows.

Though the development of a consistent and reliable series of data definitions may seem like a task inconsequential to the organization of software, such a practice is integral to the development of software with a greater scope.

Fundies Lessons

The primary motivation for the data definition in CS2500 is to have well-formed data guide our code. Deriving templates from data definitions allows us to compose template-driven functions, which can be much easier to reason about fill in than a blank slate.

As you’ve seen in your larger Fundies project, however, relying upon templates can begin to feel restrictive rather than helpful — all of a sudden, you’re playing tug-of-war with this complex data definition, contorting it to fit your ideal solution for the code. As our data becomes more and more complex, crafting data-driven functions becomes increasingly frustrating and more difficult; the template never seems to match the idea you have for designing your code.

Though it can be frustrating, designing our data can and should still drive our code. Building software with the design of our data in mind enhances communication, structures code, and allows us to tackle software development problems on a greater scale than we could begin to devise on our own.

Why Structure?

According to Thomas Hobbes, the social contract is an implicit agreement between the people and the government; in exchange for some of the freedoms of the people, the government provides infrastructure. Despite its many controversies, there is no question that this contract — one of the foundational principles of our nation — has succeeded in composing a stable system of governance.

Programs, too, are a social contract; one drawn between the developers of the code and the end user. A strict hierarchy may constrain the creative potential of development, squandering certain design patterns or lines of thinking, but it is only with this structure that we can develop abstractions vital to considering software on a large scale.

The definition of our data is the first step in the construction of our contract. If we accept the contract we’ve defined, seceding the freedom to use certain design patterns or lines of thinking in lieu of a strict structure for our data, we in turn receive a well-defined pattern of thinking we can follow for every function as well as the framework for a robust, reusable system of abstractions.

Data as Communication

There’s no question that communication is vital to any software development project. It’s rare that we work by ourselves, even if the project is individual — human or not, every library you use and API you interface with defines a different contract with which you use its data. Just as data definitions is a contract between the developer of a library and the end user, it is also a contract between developers to craft a tightly-integrated project.

Consider GraduateNU. When laying down the foundations of the project with Alex Takayama, the two of us were thousands of miles apart — he in Boston enrolled in a summer term, and I at home on the west coast. We’d been working on two different aspects of the project: while I was parsing Northeastern’s degree audit, Alex was designing a scheduling algorithm to generate a complete plan of study given satisfied and unsatisfied degree requirements.

Our ability to smoothly integrate the two large pieces of code is rooted in the definition of our data. We defined a contract — a series of regimented data definitions (in this case, TypeScript interfaces) — to hold us accountable for every aspect of the schedule, from NUPath requirements to the semesters during which courses were taken.

Look familiar? Even if you’re new to Typescript, the structure is there. Let’s translate to CS2500:

Though this data definition is somewhat outdated now, it proved vital to the integration of our code. As Alex knew his code would receive an array of completed courses with these specific fields, as well as accurate Class ID numbers and credit hours (further specified in the definition), he could rely on all of the fields of the course representation produced by the HTML parser to craft a proper schedule for a student. This contract we defined and enforced between our code held us responsible for this consistency.

Function Fits Form

However, we realized this contract was a bit malformed — the structure of this data didn’t inform the end goal, a schedule organized by semester, first, and instead prioritized the convenience of the parser.

After rewriting our application’s contract to inform the structure of the application, the GraduateNU team crafted a data definition that resembles the following:

Though the structure of the data is much more complex, it better informs the end goal of our application. Semesters are better managed as lists of courses with metadata rather than modifiable labels, and we now have clear templates for which we can write our functions with this more precise hierarchical data. Though changing a label on the ICompleteCourse was much easier than shifting a ScheduleCourse between semesters, this hierarchical data definition is far more regimented and now allows the SandboxNU developers to rely on the contract they’ve signed with their data to define the structure of the application, rather than relying upon disorganized lists of courses.

Test-Driven Development — An Aside

A beneficial side-effect of our data definitions is the ability to cheat writing our tests! With well-formed and well-defined data, we can find determine exactly what we’d like our function to accept and output, using this knowledge to compose effective, comprehensive test cases for our functions without filling them in. To craft the HTML parser, I first wrote out complex JSON representations of several sample degree audits, then crafted functions to adhere to this final representational output of the parser.

Defining these tests before the code is another way of administering this data definition contract, especially when working with dynamically typed languages — by defining the test cases beforehand, you’re preemptively promising to follow the structure of the data in the test cases and ensuring that the data produced by the function tested is not malformed.

Tips and Tricks

How, then, do I ensure I craft data definitions best suited to my problem?

As every problem is different, it’s hard to provide hard and fast recommendations for structures to follow. That said, there are some principles that should be followed when designing data:

Rely on your Type System

The best way to write robust code is to have your compiler enforce its correctness at compile time. We can do this by making extensive use of language features whenever possible for defining types, validating function arguments, or defining tests that will catch potential errors with our code when possible. This is why we don’t typically use [else …] branches in ‘cond’ statements in CS2500 — though they’re a convenient catch-all, they lose us guarantees about the properties of the variable drifting into the ‘else’ case and as such can’t as effectively catch our mistakes.

Think of the End User

As exemplified above, though our data definitions are likely at the core of our application, their impact will extend all the way to the end user through administering a strict organization of application features. After all, it’s easy to visualize a schedule by semester if the data is stored with the semesters in mind, but it is quite a bit more difficult if all of the courses are thrown in an array without regard for the organization required by the application. It’s difficult to search for posts on a blog such as this, too, if posts don’t have tags or other information to inform search engines.

Optimize your data definitions for the actions of the end user first and foremost: as their ability to interact with the application will be directly informed by your data, it’s vital that we empower them through enforcing the contract we will extend to them upon ourselves as developers as well.

Keep it Simple

It’s much better to draft many concise, nested data definitions — focusing only on the information necessary to a structure or object — than to attempt to flatten information in order to minimize data definitions. Though the latter approach seems to make data easier to manipulate, in reality it leads to bloated functions that attempt to handle too many different components of the program’s logic. If we store everything to do with a game in a ‘game’ structure, for example, it will all be easily accessible from anywhere in our program, but as our game requires more and more information to operate, functions concerning the game will have to deal with increasingly more fields while the developer will be required to be very careful about what they make use of under what circumstances. If we separate our game into a pacman structure with certain properties and a list of ghosts, each with other properties, the game will be more intuitive to work with and more easily extensible.

Compartmentalized data is easier to interact with and weave together than lengthy, bloated data definitions; it can be threaded and tied together without knots or tangles. Think of this just like function abstractions — in the same way we decide to write a helper function to handle a subset of a problem, we define another data abstraction to handle a piece of data that should be compartmentalized into its own unit. This structure will both inform our data for the better and structure our code for the future.