NoSQL Database Doesn’t Mean No Schema
Artisanal donuts > artisanal data
If I offered you an artisanal donut, it would probably earn an emphatic, “Yes, please!” But what would your response be to my offer of artisanal data? Irregular, unique data isn’t very appealing so I’m guessing you’d be less enthused. What is artisanal data? Imagine the data from each business transaction being written up as color commentary from a sporting event, “Today an older couple bought a box of our finest…” and then being asked to categorize, sort, and interpret this. This is what artisanal data looks like - and it’s not terribly useful. So what can we do to assure we have usable collections of data that can used for meaningful decision-making?
This of course raises questions about data quality, data stewardship, and even the basic meaning of the data. How can we trust one-of-a-kind data? If it doesn’t fit with other data, how was it created? Is there any real value in preserving it? While NoSQL databases offer an appealing level of flexibility in handling this, a formal schema definition can be very helpful. But first, a little background, then we’ll look at why and how we can avoid creating artisanal data.
When working with a SQL database, we’re often confronted with complex schemas that define the structure of the data. When we want to make changes to the database, we may have to wrestle with schema changes as well. The implications of making a schema change include being sure that existing data fits the new schema. Or, more commonly, that the existing application programming won’t break when we modify the database schema.
In order to avoid the cost of schema updates, clever engineers have invented a variety of databases that break the SQL mold. I’ll use the Mongo database as an example here but we could do this with CouchDB or the Elasticsearch database. This also applies to all databases that seek to reduce or eliminate wrestling with the formal schema definition.
The catch with this approach is software never really escapes from the constraints imposed by a schema. The worst scenario is an informal, chaotic schema where every fact is a unique, glittering snowflake. To do bulk analysis, some order must be imposed on the data. And while it’s good to have application code that’s nimble and responsive to change, at some point being too nimble can begin to look like stumbling. So how can we achieve a high level of flexibility and still impose meaningful constraints on the data?
And yes, this is a “Python to the Rescue” story. Much like my last blog post.
One SQL database tactic entails using three closely-related schemas to describe our data. These schemas provide a tidy, formal definition that can assure the data and processing will agree. These three schemas are:
● As data analysts, we care about meaningful data that answers our questions. In a relational database, this schema is the view definitions and tables of the logical model.
● As application developers, we care about a logical model of the data. This schema will impose a structure on application code. The schema should also meet a number of technical constraints to reduce data duplication, and prevent “update anomalies”. (Imagine the confusion when you update one transaction and the total balance doesn’t change. Database engineers call this a “normal form” violation.)
● As database administrators, we care about a physical model of the data. This schema defines where the data lives in the files and disk drives of some computer(s).
While the core idea is elegant, the implementation can lead to increased costs and complexities. Note that these schemata don’t capture meaning; they’re focused on structure. As data analysts, for example, we might know of different policies and constraints that are required to understand the data. The difference between values like “incomplete” and “not applicable” is not part of the database schema: in the database they’re both merely strings of letters. To someone who works with the data, however, these strings may reflect profound differences in meaning.
Also, to make things a little more complex, each of these schemata enjoys several representations. The schemas have a formal representation as SQL language statements. They also exist in the database where they can be used by the database engine. Further, they are also reflected in application programs. Developers and database administrators will write and execute SQL statements to define and modify the database structure. Devs will also write code (in Python, for example) that reflects the database structure.
Keeping the database, the application processing, and the SQL code in synch is where costs and complexities can arise. While the three-schema idea isolates the different communities from each other, it also means that a change to one can have ripples into another. If the developers change the logical view, the definition of the user’s view may also need to change so the view remains consistent. When the business model changes, the devs may have to adjust the logical schema — and possibly the physical schema –to make sure new features can be processed.
As we go forward, we’ll see different schema representations. The application code is — in effect — a copy of the logical schema, but written in a programming language like Python. This means that developers must coordinate the code version of the schema with the SQL version of the schema.
One NoSQL database approach is to eschew the SQL language. This is part of reducing the burdens of a formal schema definition. It’s possible, of course, for this kind of freedom to devolve into anarchy. NoSQL Anarchy is that feeling when it seems to take too much code to make the data usable.
The following features can help to avoid anarchy:
● A formal schema
● That is defined only once
● That is preferably in the code
We want a formal schema to provide a tidy, central definition of the data organization. We want it defined in one place so that we don’t have conflicts. Putting the schema into the application code means that application developers can create both the schema definition and define the processing for the data. The almost-too-obvious-to-repeat benefit is to eliminate writing SQL statements. The other benefit is to allow application developers to do some of the database administration work. This can reduce the cost and complexity of schema changes.
How can we get the benefits of a schema without the overheads? I’ll cover several techniques available in Python below.
The approach that can most quickly devolve to anarchy and artisanal data is an entirely ad-hoc schema buried in the code. Here’s an example that conceals a possible epic failure.
Where is this failure hiding? In this code, the doc[‘status’], doc[‘name’], and doc[‘address’] are all based on some implicit schema definition. Do other programs also use these fields? Do all of the programs that interact with this database agree on the schema? What happens if they don’t?
As you can see, this approach can lead to artisanal data. Different application programs can disagree on the definitions of the structure of documents and the names of collections in the database. This means that documents in the database don’t have a common schema, and our big data analysis turns into big data cleanup. Worse, we might have data that’s invisible to application programs, or application programs that produce the wrong results because the program which produced the data didn’t have the same schema as the program analyzing the data.
An Object-Document Mapping Layer
Each of these allow us to define the schema via class definitions. We can use something like this:
This has the advantage of elegant clarity. We can add document-specific processing as methods in the class definition. Adding methods can have some advantages, however, it’s not universally valuable.
However, in some applications, particularly with big data analytics, detail documents won’t have any methods to speak of. The processing overhead to create these Python objects might not be beneficial. It also might be more efficient to directly process the lower-level Mongo-friendly BSON objects.
If we’re only working in Python, these class definitions have the advantage of being a tidy, complete definition for the documents in our database. We can use this to avoid artisanal data.
Additionally, if we’re working in a polyglot environment, we may have to share definitions among multiple programming languages. In this case, the Python class definition isn’t ideal, since it can’t be shared across languages.
Schema as Separate Metadata
A good technique for avoiding anarchy is to have a schema definition in some language-agnostic form. This can be shared among programs and tools. Different programs can then agree on the schema, and we can avoid artisanal data. Many NoSQL databases represent documents in JSON notation. We can also represent a schema definition in JSON notation.
See http://json-schema.org/ for an elegant formal schema specification using JSON documents.
A JSON-based schema definition is the backbone of the OpenAPI (also known as Swagger) specification. It provides a reasonably complete set of rules for defining a schema for the kinds of documents we can keep in a NoSQL database. We can represent the schema using a JSON schema definition. We can then use a variety of tools to validate documents before loading database documents. For this post we’ll focus on validating the schema and the data with the jsonschema project.
See https://python-jsonschema.readthedocs.org/en/latest/ for more.
When we work with database documents as Python lists and dictionaries, we can use jsonschema to determine if those data structures match a given schema definition. A Mongo SON object is an extension to the Python dictionary, making it ideal for validation using jsonschema.
We’ll look at three use cases for a JSON schema definition: design and modeling, data validation, and schema migration. After that, we’ll look at three places where we can store the schema definition so that all our Python (and non-Python) programs can use it. Finally, we’ll look at Python metaclass that makes this relatively simple.
Design and Modeling
Database design often starts on a whiteboard. (Who am I trying to kid? It seems to frequently start on a coffee-shop napkin filled with donut crumbs. And then moves to a whiteboard.)
At some point, a tangible example is more valuable than another presentation or sketch on a whiteboard. When using Python, that point can arrive very early in the life of a project. The sooner we get to tangible examples, the sooner we can identify counter-examples and performance bottlenecks.
My personal take is that creating technical spikes in Python may be the ideal way to explore design alternatives. Rather than thrash through competing presentations, it’s often easier to simply build the Python code and show details of the data and performance implications.
Let’s say we’ve got several million customer profile records. Let’s further pretend that we’re analyzing source data that unifies information from several lines of business, each of which has a unique view of the customer relationship.
Here’s how we can start this project. First, we can throw together classes that approximate the document definitions we’ll be working with. A class with an embedded schema definition can look like this:
The class extends the PyMongo bson.SON class definition. This class is an ordered dictionary, with a few Mongo-friendly features added to it.
Within the Customer1 class we’ve provided the JSON schema in the form of a SON object, assigned to the class-level variable, SCHEMA. This schema can be used for a number of things.
Since the class is essentially a dictionary, we can experiment with different kinds of documents freely. We’re able to populate documents and draft experimental processing flows at very low cost. This is the benefit of a technology spike built with Python.
I like to include the pass statement to affirm that there’s really nothing else in the class beyond the docstring. The statement doesn’t do anything, and isn’t required, but I find it to be a helpful reminder that the class definition is complete.
Now we can throw together documents that illustrate the schema with concrete examples. Creating two instances of the schema looks like this:
We can use these objects to design algorithms and HTML presentations. We can also refine the schema as we come to better understand the problem and each potential solution.
NOTE: there is a downside for SON as the base class. Attribute references use a lot of syntax, leading to expressions like this:
Generally we’d prefer to use c2.address to refer to the value of an attribute. For that, we’d have to use a sophisticated metaclass. The point here is to work through ideas quickly, not build the final application.
For design and modeling, it helps to validate every single document every time it’s used. This adds a little processing overhead. The benefit of knowing that the documents fit the model far outweigh the cost.
So how do we know the schema itself is valid? A check for the schema’s syntax looks like this:
We can make a quick check to be sure the schema itself fits the JSON-Schema standards. With some small exceptions, a schema that fits the JSON schema standard will also fit the Open API (Swagger) standards. [If you’re curious: Swagger changes a few of the more advanced features of JSON schema. Swagger doesn’t use the allOf keyword; it also limits what can appears in an items, properties, and additionalproperties keywords.]
We can use the following bit of code to emit a JSON document suitable as an external reference in a Swagger specification:
This will translate the SCHEMA document into JSON notation.
Looking carefully at the example documents — c1 and c2, shown above — we can see that document c2 is not compliant with the schema: the status field has an invalid value. When creating technical spikes, this is a handy feature. A relaxed approach to validation allows us to explore the problem space without simultaneously struggling against technical hurdles.
As we move toward a robust, scalable solution, the data must be valid. Data validation is a complex subject and there are several degrees of valid:
● Atomic Field Values. Fields which have values that are single instances of primitive types are covered nicely in the jsonschema validation.
● Document Structure. Fields with complex array or document values are covered nicely by jsonschema validation, also.
● Document Relationships. We might have a document which has an array of balances and a single field with a total for that array. The JSON schema doesn’t provide a formal way for specifying relationships among items. We can provide description text for this.
● Collection of Documents. We might have a further level of constraint where one collection in a Mongo database has summary information that describes documents in a another collection.
A simple case of field validation looks like this:
This will apply the validate() method to a document which will raise an exception for any invalid value. We might also want to use a try: statement to print something meaningful for the raised exception. We’ll return to that later.
The exception summary looks like this:
When we’re doing design work, this can help us determine that the design covers the examples. We can leverage this simple technique to build examples and refine schema quickly.
Generally, we need to watch how often we do data validations. For large collections of data, and complex documents, the processing times can add up. It’s essential to validate documents when they’re created. The other time that it’s essential is when doing a schema migration.
For operations in application code, however, we don’t want to do this kind of manual validation. For that, we’ll need something a little bit smarter. Below, we’ll show some code for a metaclass that handles the validation processing when objects are created.
Schema migration happens when we’re going to make wholesale changes to the structure of documents in a collection such as adding a field to all existing documents or renaming fields. Adding an optional field is a less drastic change to the application: the code can continue to use old documents without the field as well as new documents with the field.
When making changes to existing documents in a database, it’s helpful to make sure the documents actually conform to the old schema before attempting any changes. We should also check all documents to make sure they conform to the new schema before trying to use the database after the change.
We can use the jsonschema validation like this:
We’ve used the Customer1.SCHEMA to validate each object in the customer_1 collection. We’ve used a try: block so that we can print each exception that gets raised and display the non-compliant documents.
This will take time for large databases or complex documents within a database. The time required has to be weighed against the cost of non-compliant documents. Since the code is so simple, it’s easy to decompose the processing into multiple concurrent processes, each of which validates a portion of a large database.
A migration may involve doing additional processing. In that case, we can create an instance of the Customer1 class like this:
This will transform the underlying Mongo BSON document into an instance of Customer1. If we have customized methods defined, they’re usable on document.
Places to Keep the Schema
In the examples above, we’ve included the schema within a class definition using a class-level variable, SCHEMA. We have three other common choices for where to keep the schema: in a file, in a server, and in a database.
In A File
To include a schema from a file, we might use something like this:
We’ve defined a Path to the directory which contains the schema definitions. We can then open a specific schema file, and use this object to populate the definition associated with an application class.
In A Server
We might also use a URL to load a schema from an api-docs directory. We can use urllib.request to read a schema definition.
Now we’ve opened a URL and read the JSON specification. These are generally available as bytes and we’ll need to decode the bytes to recover proper text. We can then parse that text to create the schema document that our application can use.
In A Database
If we want to keep schema information in a database, we might use something like this to read the schema details.
This assumes that we’ve got a PyMongo client named db. It also requires that the database has a schema collection that includes the various schema documents. We’ve used the schema title and an extension (x-version) to locate the proper version of a schema definition for use in our application.
With MongoDB, there is a schema validation rule set that can be attached to a collection. This provides a way to implement some aspects of the JSON Schema validation during database insert and update operations.
In all three cases, the amount of code required to load a schema from an external source is small. This is a delightful feature of Python! Because Python is a dynamic language, we can write very clever classes which adapt to schema details.
Generally, we want our schema validation to be essentially invisible. The best way to create seamless validation is to use a metaclass.
A Python Metaclass
Our application code is generally focused on instances of some data model. Above, we created classes which extend the SON collection. This required a little bit of manual overhead to do validation. We can make the validation step seamless using a metaclass.
Adding a metaclass will lead us to an abstract Model class. Having this abstract class that defines the base class and the metaclass will simplify our applications. Here’s the superclass we can use for validated NoSQL data models:
This Model class is a subclass of PyMongo’s SON class. We can interact with Mongo SON objects in a direct and low-overhead way.
Any subclass needs to provide a value for SCHEMA and the given schema definition will be used for data validation. This will involve a little bit of metaclass magic, as shown below.
The __init__ method, for example, makes use of an attribute named _validator that’s created by the metaclass. It’s helpful to set a few other class-level values based on the details of the schema definition.
We might use the Model class as follows:
This is essentially identical to the example given above. We’ve made one material change: we’ve replaced SON with Model. The rest of the class definition is identical, but we now have the validation feature available.
How do we create the self._validator? Here’s the metaclass that builds the _validator attribute from the schema definition:
We’ve done a few things to prepare a subclass of Model for general use. First, we checked the schema to be sure that it was valid according to the JSON schema specification. This step isn’t required because it will also be performed when we attempt to build the _validator. However, it’s handy to do this separately because it can help debug problems with the schema definition.
If the schema is valid, then we’ll extract four values from the schema.
● We’ll use the schema’s description property as the docstring for the class as a whole.
● A schema extension, x-collection, is used to set the COLLECTION attribute. This can be handy for identifying the proper MongoDB collection name.
● A schema extension, x-version, sets the VERSION attribute. This can be handy for checking application compatibility.
● Finally, we create the _validator attribute of the class. This allows us to use self._validator.validate() to check validity of an instance of the class as needed. By default, we only do this when creating a new instance. We can add methods that will also use the _validator. Also, other classes can use Customer1._validator as part of data preparation or cleansing operations.
A NoSQL database removes some types of complexity and overhead from the development of database solutions. While it can allow us to discard a formal schema definition, this may be too much freedom. A NoSQL database without some controls over the data can devolve into artisanal data that dilutes the value of the data. The general idea of a shared schema is essential for assuring that all the applications that use a database agree on the database structures. A box of a dozen donuts shouldn’t include tennis balls, oysters, or kittens. (It shouldn’t include jelly donuts, either.)
We can embed the schema within Python class definitions, we can save it in external files in JSON notation, we can provide it from a server, and can even save it in the Mongo database. What’s important is that we adhere to three core principles:
● A formal schema
● That is defined only once
● And is preferably in the code.
We can apply the schema to data validation at several points in a project’s lifecycle. We might want to have a lot of flexibility during initial design. When we build the application, we want assurance that object instances conform to a schema. When making structural changes to a database, we would like to confirm that all the documents match a schema before attempting any changes. We’d also like to confirm that all documents match a schema after making the changes.
To gain the flexibility of a NoSQL database, and avoid artisanal data, we need a flexible definition of the schema. Using a module like jsonschema allows us to represent a schema in JSON. The schema can be embedded in a class definition, and it can be used to validate all instances of the class.
The examples above include a small (10 lines of code) metaclass definition, SchemaMeta, and another small (18 lines of code) superclass definition, Model. With this we can leverage a schema with database-oriented class definitions. This gives us the ability to assure the quality of the data, and avoid creating data that can’t be analyzed, or applying the wrong analysis to the data.
The power of Python means that a few lines of code can provide an essential capability for managing data. This can avoid the potential for chaos that can come with a NoSQL database. I’d much rather enjoy artisanal donuts than struggle with the complexities of artisanal data. And I’m betting you feel the same.