Microsoft SQL Server’s Graph — An attempt that fell short (for now)

TRAN Ngoc Thach
Jun 29, 2020 · 9 min read

Disclaimer: The opinion in this article is mine alone, based on my limited exposure to SQL Server’s Graph. I am in no way an expert on SQL Server. I might be biased toward Neo4j.

Introduction

With the increasing use of Graph Database and the dominance of Neo4j in this niche market, as well as its significant performance gain in querying highly connected data, it is understandable Microsoft does not want to be left out of this mega-trend.

Starting from SQL Server 2017, SQL Server offered Graph functionalities, with the introduction of theMATCH keyword. In 2019 version, among others, two new notable features were added: Edge Constraints and the use of derived tables in MATCH queries. Diving in this technology long enough, I have an impression that, as of now:

  • The Graph support in SQL Server are still far from a full-fledged Graph Database, e.g. Neo4j.

Let’s get into details!

Pros & Cons

TheMATCH keyword is, in my opinion, merely a Syntatic Sugar. Rather than:

SELECT *
FROM NodeTable1 nt1 JOIN EdgeTable et ON nt1.$node_id = et.$from_id JOIN NodeTable2 nt2 ON nt2.$node_id = et.$to_id

… users can pretty much shorten the query as:

SELECT *
FROM NodeTable1 nt1, EdgeTable et, NodeTable2 nt2
WHERE MATCH(nt1-(et)->nt2)

The shortened query is undoubtedly pleasant to the eyes. However, in .NET world, Entity Framework may be preferred for data-oriented software applications; as a result of which there is no need for SQL plain-text commands. Developers just let the framework generate optimal queries for them while they are focused on high-level abstractions. By this logic, the pleasing effect of MATCH is suddenly irrelevant. Worse, at the moment, there is even no Roadmap in supporting these Graph capabilities in Entity Framework (Core).

Comparing to Neo4j, looking again at the above queries, it is unavoidable for users to bend their Graph mindset into Relational Database world. We still have to declare in advance which tables, in FROM clause, to carry out the matching (Should it rather be the Graph considered as a ‘traversable whole thing’ and one doesn’t need to care about detailed data organization?). When knowing the arrow direction in MATCH is a must, users come to realize the 1–1 mapping between MATCH clause and the original JOIN clause, in the sense that one has to distinguish the Source Node and the Target Node. In contrast, Neo4j’s Cypher syntax allows to disregard this distinction. All those are not necessarily a game-changer; but simply a matter of convenience and joy of development. (Later, in the 1st Study Case, we’ll see we have to do a UNION ALL because of this.)

One key to high performance of native Graph Database in traversing connected data is index-free adjacency. Unfortunately, there is no such thing with Graph feature in SQL Server:

We are not maintaining an adjacency list on every node; instead we are storing edge data in tables. Because it is a relational database, storing data in the form of tables was a more natural choice for us

Another annoying issue is the lack of visualization tools tailored toward Graph, for both schema and data. As a workaround, users can still create a Schema Diagram in SSMS, but those Nodes and Edges Tables end up as individual, disconnected tables even in case they have Edge Constraints. Among other things, Graph technology should also bring visual appeals, facilitating users to construct correct queries. Lacking such tools is frustrating and detrimental to developers’ productivity.

Diagram for Graph Tables

So what do we do to get the Graph schema? I guess we open up each Edge Table to manually see the constraints, and then reason on them, with the help of a piece of paper, to figure out the whole Graph schema.

Graph’s Edge Constraints

Regarding data visualization, I tried Microsoft Power BI with Force-Directed Graph addon. However, this tool is not free, and it does not support SQL Server’s Graph capabilities out-of-the-box, meaning it sees the Nodes and Edges tables as normal database tables. Examining closely the column lists, there are weird-looking column names, e.g. graph_id_...,from_obj_id.... Those are internal columns, automatically generated when creating Node/Edge tables, inaccessible from outside. An error is raised, as below, by Power BI’s Get Data function if accessing those columns.

Power BI — “Get Data” from SQL Server’s Graph tables

To workaround this issue, one must create a Database View containing only relevant/reachable columns, e.g. from_id and to_id in Edge Table, node_id in Node Table. Then in Power BI, extract the data through that View. Hopefully I am not mistaken, Power BI’s Force Directed Graph seems to require one single table which consists of columns Source, Target, Weight, Source Type, Target Type, Link Type. Our sample schema is simplistic; thus creating this single table is trivial. What if the schema contained +20 Nodes Tables and +10 Edges Table, could we end up with a single table to visualize the whole Graph with Power BI?

#Updated on 04.07.2020: On second thought, this is surely possible even though it is still cumbersome. One should start at Edges Tables, in each of which the $from_id and $to_id are joined against relevant Nodes Tables in order to translate into shared properties of all Nodes, e.g. Name or Id. Then do UNION ALL for all to combine into the final single table required by Power BI.

Querying for nodes and edges is one of possibilities of Graph technology. What can make a Graph database stand out is the built-in yet extensible support for Graph Algorithms, such as PageRank and Louvain Community Detection (aka. Louvain modularity). Sadly again, no such analytics functionalities available in SQL Server’s Graph, as said:

Some graph databases provide dedicated graph analytical functions like “shortest path” or “page rank.” SQL Graph does not provide any such functions in this release. Again, T-SQL loops and temp tables may be used to write a workaround for these scenarios.

In Neo4j, those algorithms are readily available at production grade, thanks to Graph Data Science Library. Plus, it is open-sourced. Generally, documentation is helpful, but sometime not sufficient. With open-source, developers can download and dive into how a specific algorithm is implemented; or even recompile the library, with additional logging info, to further understand. Implementing PageRank in SQL is possible, but more complex algorithm such as Louvain modularity can be challenging. Nevertheless, many software engineers prefer to be more bothered with business logic, instead of being bogged down with low-level technical details.

Last but not least, in my view, Common Table Expression is a kind of derived tables. Since SQL Server 2019, this technique is officially supposed to be workable with Graph:

Common Table Expression (CTE) with Graph tables and MATCH clause

But if following VIEW creation, it is do-able:

View creation with Graph tables and MATCH clause

I think this is a sign of inconsistency. The 2nd approach requires a permanently created View as it is not possible to create Temporary View in SQL Server. Normally, users overcome this with CTE, but again CTE doesn’t work with Graph, as showed above.

Study Cases

An sample scenario: Students and Teachers.

Neo4j’s Cypher:

CREATE (S1:NStudents {name: "S1"}), (S2:NStudents {name: "S2"}), (T1:NTeachers {name: "T1"}), (S3:NStudents {name: "S3"}), (T2:NTeachers {name: "T2"}), (S4:NStudents {name: "S4"}), (S1)-[:TALKS]->(S2), (S2)-[:TALKS]->(T1), (T1)-[:TALKS]->(T2), (T2)-[:TALKS]->(S4), (S3)-[:TALKS]->(T1)

SQL Server 2019’s Graph:

CREATE TABLE NStudents ([Name] NVARCHAR(MAX) NOT NULL, [Score] INT NOT NULL) AS NODE;CREATE TABLE NTeachers ([Name] NVARCHAR(MAX) NOT NULL, [Salary] FLOAT NOT NULL) AS NODE;CREATE TABLE Talks (CONSTRAINT EC_Talk CONNECTION (NStudents TO NStudents, NStudents TO NTeachers, NTeachers TO NStudents, NTeachers TO NTeachers) ON DELETE CASCADE) AS EDGE;INSERT INTO NStudents VALUES ('S1',1),('S2',2),('S3',3),('S4',4);
INSERT INTO NTeachers VALUES ('T1',123), ('T2',456);
INSERT INTO Talks VALUES (
(SELECT $node_id FROM NStudents WHERE [Name] = 'S1'),
(SELECT $node_id FROM NStudents WHERE [Name] = 'S2'));
INSERT INTO Talks VALUES (
(SELECT $node_id FROM NStudents WHERE [Name] = 'S2'),
(SELECT $node_id FROM NTeachers WHERE [Name] = 'T1'));
INSERT INTO Talks VALUES (
(SELECT $node_id FROM NTeachers WHERE [Name] = 'T1'),
(SELECT $node_id FROM NTeachers WHERE [Name] = 'T2'));
INSERT INTO Talks VALUES (
(SELECT $node_id FROM NTeachers WHERE [Name] = 'T2'),
(SELECT $node_id FROM NStudents WHERE [Name] = 'S4'));
INSERT INTO Talks VALUES (
(SELECT $node_id FROM NStudents WHERE [Name] = 'S3'),
(SELECT $node_id FROM NTeachers WHERE [Name] = 'T1'));

Case 1: All incoming and outgoing connections of nodes

Motivation: Counting these connections for each node is one way to find out which ones are most important.

Neo4j’sCypher:

MATCH (n)-[r]-()
RETURN n.name, COUNT(r) AS allCons
ORDER BY allCons DESC

SQL Server 2019’s Graph:

--Create a View first, for convenience purpose.
CREATE VIEW view_AllPeople
AS
SELECT $node_id AS [node_id], [Name]
FROM NStudents
UNION ALL
SELECT $node_id AS [node_id], [Name]
FROM NTeachers;
--Query using the View.
WITH CTE([Name])
AS
(
SELECT ap1.[Name]
FROM view_AllPeople ap1, Talks t, view_AllPeople ap2
WHERE MATCH(ap1-(t)->ap2)
UNION ALL
SELECT ap1.[Name]
FROM view_AllPeople ap1, Talks t, view_AllPeople ap2
WHERE MATCH(ap1<-(t)-ap2)
)
SELECT [Name], COUNT(*) AS allConns
FROM CTE
GROUP BY [Name]
ORDER BY allConns DESC

Remarks: The SQL version is not only longer, but also more inconvenient as the direction of arrows in MATCH must be always taken into account.

Case 2: Top longest paths

Motivation: Long dependency chains are prone to be fragile. For instance, library dependency.

Neo4j’sCypher:

// The WHERE is to filter out duplicate paths, e.g. A->B = B->A.
MATCH p=(n)-[*1..]-(m)
WHERE ID(n) < ID(m)
RETURN n.name, m.name, length(p) AS len, [x IN nodes(p) | x.name] AS node_list
ORDER BY len DESC

SQL Server 2019’s Graph:

WITH CTE(from_id, to_id, [Level], [Path])
AS
(
SELECT $from_id, $to_id, 1 AS [Level], CONVERT(NVARCHAR(MAX), $to_id) AS [Path]
FROM Talks
UNION ALL
SELECT t.$from_id, t.$to_id, [Level]+1, CONVERT(NVARCHAR(MAX), CTE.[Path] + ',' + CONVERT(NVARCHAR(MAX), $to_id))
FROM Talks t JOIN CTE ON t.$to_id = CTE.[from_id]
)
SELECT vap.[Name], [Level], [Path]
FROM CTE JOIN (SELECT MAX(c.[Level]) AS maxLevel FROM CTE c GROUP BY c.[from_id]) myMax ON CTE.[Level] = myMax.maxLevel JOIN
view_AllPeople vap ON CTE.[from_id] = vap.[node_id]
ORDER BY [Level] DESC

Remarks: In SQL Server version, one has to recourse to Recursive Common Table Expression technique. It pretty much embraces Relational Database mindset to solve Graph problems.

Conclusion

In Graph domain, a SQL Server’s SQL command, with Graph feature utilized or not, used to tackle Graph problems, is typically much lengthier as well as more complex, in comparison to Neo4j’s Cypher. This leads to an implication that the code will be more time-consuming to develop, and difficult to later maintain and extend. Another engineer or even the original one, looking at the same code snippet one month later, will find it frustrating to grasp all of its aspects. A well-known term to describe this situation is technical debt.

Combining all of the aforementioned points, from code aspect, to feature-support, to tool/library ecosystem, SQL Server’s Graph functionality at present, although being encouraging, falls short of expectations.

SQL Server is extremely mature with respect to Relational Database, but clearly a newbie in Graph Database. This Graph support is probably deemed to be contained in Relational Database mentality.

The Startup

Get smarter at building your thing. Join The Startup’s +793K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

TRAN Ngoc Thach

Written by

I’m a Data Engineer, working at KPMG, Frankfurt am Main, Germany. (https://www.linkedin.com/in/thachngoctran/)

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +793K followers.

TRAN Ngoc Thach

Written by

I’m a Data Engineer, working at KPMG, Frankfurt am Main, Germany. (https://www.linkedin.com/in/thachngoctran/)

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +793K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store