The first thing we need to do is understand what graph processing is and how it relates to SQL Server. In the second part of this series, we will talk about Entity Framework Core and how we can use it with SQL Server graphing. If you don’t know or care about EF Core though, don’t worry, that’s the last time we’ll mention it in here!
Here however, we will focus entirely on the database.
You may be thinking about bar and pie charts? Or now maybe just pies… Either way, you’d be wrong or maybe just really hungry. In SQL Server 2017 (which of course means also in Azure SQL by default) Microsoft introduced graphing to the database engine, but what is it? From the Microsoft documentation:
A graph database is a collection of nodes (or vertices) and edges (or relationships).
I like the term “relationships” in this context as it describes quite succinctly what is going on. For example, let’s say you have a database table containing People, the relationship here could be “is a friend of” that connects 2 people in the table.
There’s nothing this feature brings that SQL Server couldn’t do before, but it does bring some important benefits:
- Traversing the graph makes queries much easier and readable. Previously you would need to laboriously JOIN tables together.
- Relationships don’t require primary keys and foreign keys — all this is handle for you.
- Graph queries can perform better because the database engine can apply certain optimisations.
There’s a few minor differences with regards to the tables. Other than these, the tables basically behave exactly as you would expect and you can run all your normal CRUD operations on them.
First, we need to tell SQL Server that the tables are nodes or edges, and that is as simple as tagging “AS NODE” or “AS EDGE” on the end. So instead of this:
A relationship (or edge) table is similar:
And in fact, an edge table doesn’t even need to have user defined columns:
Relationships and Keys
The second difference is that we don’t have explicit relationships between nodes and edges; there are no foreign keys! You can link any two nodes together with any edge you like. So if you have node tables for people and locations and edge tables for likes and owns, it is now trivial to say “people like locations”, “people like people”, “people own locations” and even “people own people” (though I highly recommend you stay well away from that last one!)
So how does SQL Server know how to relate rows together? Well for that we have some psuedocolumns…
Despite my editor wanting to auto-correct psuedocolumn to pseudocelli (which apparently is a pore or an eye on some sort of primitive insect… definitely not one for the trypophobics out there) it is a real thing. These are additional columns (a bit like calculated columns) that are created for you and that you cannot edit. For example, taking our FooNode table from earlier, let’s run a SELECT * across it:
Notice we have a new $node_id column, the dollar sign prefix is what identifies it as a psuedocolumn. You can effectively forget about the seemingly random set of characters that follows the column name. You can select that column manually with:
As for the edge tables, there’s a couple more psuedocolumns. In addition to $node_id we also have $from_id and $to_id:
These should start to give you an idea how SQL Server is constructing the graph and the relations between nodes.
This is simply a matter of inserting the $node_id of two nodes into an edge table. For example:
In a real situation you wouldn’t have the $node_id values specified this way, you would almost certainly do something like this:
And that’s it! Pick any nodes you like and relate them together.
Querying the Graph
The real power of SQL Server graphing is getting data out of the system. Imagine you were doing this in a classic SQL relational way with foreign keys and joins. I’m sure you’ve all felt the same pain as me when trying to remember the names of the columns (“Is it ‘ID’ or ‘ProductID’?”)
Well now we have the MATCH clause. Given our tables above we can query it like this:
That’s it. It might look a bit daunting at first, but when you look closer, all you’re doing is specifying 3 things. A node, a relationship and another node.
You can go one step further too and traverse the graph over a second hop:
Now imagine how that would look as a series of JOINs, oh… the horror!
OK, so the above is all a bit dry, how about we do something a little more substantial and tangible. Let’s say we want to create a social media platform that lets people make friends and rate locations around your city. First things first… creating the nodes and for this all we need are people and locations.
Now the edges, our relationships are simply friends and ratings.
And since empty tables are useless without data, let us seed some in there.
Now since we’re being friendly, we should make sure our fake people are also friendly. To do this we are going to add in some relationships by inserting rows into our edge tables. In our case, Alice is friends with everyone we will link here with everyone else using the $node_id values from the node tables, these are effectively our keys.
Since everyone loves the Tyne Bar, everyone has visited and given it a rating of 5 out of 5
Most people are sensible, so they don’t particularly like Gotham Town though. Well, apart from Bob who, let’s face it, is a bit of an animal.
Oh, and Charlene is a big drinker who also love Tilleys.
Right, now what to do with all this lovely data? All the above is just boring data that we could have created with primary keys and foreign keys, the real power of SQL Server graphing is in the querying. Let’s ask some interesting questions of our database.
First, we will start with something simple, get the average rating for all locations and order it best to worse
Which gives us a nice list of places we might like to visit.
OK, something a little more complex, how about we get a list of all locations that were rated 5 stars by Alice’s friends?
Note the double hop over the graph here from people to people to locations. Another point is that we need to alias the People table as we are using it twice. This gives us two locations
It’s important to note that relationships are directional. So in this context, think of the friendship relationship more like the way Twitter works rather than Facebook, in that you can follow someone but they don’t necessarily follow you back. So if we naively run this query to get all the people who rated Tilleys, it will return an empty result set.
So, how would we change the friendship to be more Facebook style? There’s a couple of ways.
- Add the reciprocal relationship, so Alice is friends with Bob and Bob is friends with Alice.
- When querying, make sure you always check both directions.
For the second option, a query to find out all of Alice’s friends that are also friends with her might look something like this…
Or another variant:
Other Things to Think About
It’s not all fluffy bunnies though, there are some downsides.
Keys and Indexes
We can (and should) still be creating primary keys and indexes on both NODE and EDGE tables and you can also include the psuedocolumns. Remember that they’re still just tables and there’s no reason people can’t treat them that way.
With our example above, it’s possible to create a relationship that says a location can be friends with a person. That really doesn’t make sense unless we’re talking about sentient buildings, and without super advanced AI or haunted houses, we don’t really want that to be allowed. So how would you stop it? As far as I can tell, there isn’t a good way to do this, which leaves us two options:
- Create INSERT triggers on the edge tables to validate your from and to tables. I’m not a big fan of these generally, there’s usually a better way.
- Just allow the relationships to be created. If you’re not going to write queries to retrieve these values, then there’s an argument to have that they don’t really matter.
Which one you choose depends on your business case, and maybe how much you trust your developers to not create silly relationships.
As you might imagine Microsoft are not sitting on their hands and are working on improvements to this feature. There’s a few key changes in SQL Server 2019 but the main one is almost certainly the ability to add edge constraints. This feature will fix the issue above and prevent queries being able to create those weird relationships.
I hope this has served as a handy introduction into SQL Server graph processing and encourages you to use it in your application. If you do then please get in touch and let me know, I’m keen on learning how people are using this in a production capacity.
My reasons for writing this were two-fold. Firstly, I wanted to learn this myself, solidify my own knowledge and pass that on to others. I don’t think there can be too many tutorials online.
Secondly, this originally came from a question posted on Stack Overflow titled Querying SQL Server 2017 graph tables from Entity Framework. My initial reaction to that was “Hell no, stay away from combining those two things” but after thinking about it a little deeper, I wondered if I could make a pull request to the Entity Framework Core source code to support this feature. Since then I have submitted a pull request but it will be a while before it gets reviewed by the EF team as they are currently planning for v2.2 and after that comes v3. I don’t know if my code will get merged but I wanted to tell the story of how I went through the open source journey and hopefully encourage others to try it out too. After all, isn’t that one of the best things about open source?