Speech To Cypher

An experiment on how to integrate speech to text with the Cypher query language

Valerio Piccioni
LARUS
10 min readOct 21, 2021

--

The goal of this article is to produce a pipeline such that users can express cypher queries (or at least a subset of it) by speaking some structured natural language. With structured natural language I mean a language that uses English words (natural part) and combines them following a grammar derived from cypher (structured part).

What is Cypher

Cypher is a declarative graph query language that allows expressive and efficient querying on a Neo4j Graph Database. Cypher is a relatively simple but still very powerful language that primarily focuses on expressing what to retrieve from a graph instead of how to retrieve it. Cypher borrows expression approaches from different languages like SPARQL, SQL, Haskell, Python etc.

Queries are built up using various clauses. Clauses are chained together, and they feed intermediate result sets between each other. For example, the matching variables from one MATCH clause will be the context that the next clause exists in. You can learn more on cypher on this link

Speech to text

Speech to Text (STT) is a subfield of computer science that develops methodologies and technologies that enable the recognition of spoken language into text by computers. A lot of common technologies nowadays use STT models, from general purpose vocal assistants like Siri to memo writing applications. There are several websites and companies that expose APIs (some as a paid service) to transform speech into text. We will use for our experiments a python library called SpeechRecognition that makes it easy to call some of those exposed APIs to get the text (Bing Speech, Google Web Speech, IBM Speech to text etc.)

Structured Natural Language

Some Patterns

As said in the introduction, we will need to define a new grammar in order to express cypher queries in english words. In Neo4j, graphs are defined as property graphs .

A property graph may be defined in graph theoretical terms as a directed, vertex-labeled, edge-labeled multigraph with self- edges, where edges have their own identity. In the property graph, we use the term node to denote a vertex, and relationship to denote an edge.

A node is described using a pair of parentheses, and is typically given a name; it can also have one or multiple labels.

() | (a) | (a:User) | (a:User:Admin)

In order to express this in our Structured Natural Language (SNL) we will use the keyword NODE (keywords will be written in italic Uppercase to highlight them)

NODE | NODE a| NODE a LABEL user | NODE a LABEL user admin

Cypher uses properties on nodes and relationships and can be expressed in patterns using a map-construct: curly brackets surrounding a number of key-expression pairs, separated by commas.

(a{name:'Andres',sport:'Brazilian Jiu Jitsu'})

(a)-[{blocked:false}]->(b)

In our SNL a map is constructed with the MAP,KEY,VALUE keywords:

NODE a MAP KEY name VALUE Andres KEY sport VALUE Brazilian Jiu Jitsu

NODE a RELATION RIGHT MAP KEY blocked VALUE FALSE NODE b

The simplest way to describe a relationship is by using the arrow between two nodes, as in the previous examples. Using this technique, you can describe that the relationship should exist and the directionality of it. Direction, if present, will always go after the RELATION keyword in our SNL and is defined by the LEFT,RIGHT keywords. Also boolean values TRUE and FALSE are considered keywords.

Much like nodes, relationships can have types and names. Types are described with the keyword TYPE. Relationships could have any one of a set of types, then they can all be listed in the pattern, separating them with the pipe symbol. We can list them in our SNL

(a)-[r:KNOWS]->(b) | (a)-[r:KNOWS|LOVES]->(b)

becomes

NODE a RELATION RIGHT r TYPE knows NODE b

NODE a RELATION RIGHT r TYPE knows loves NODE b

A sequence of many nodes and relationships can be described by specifying a length in the relationship description of a pattern. We can also do the same in our SNL with the keywords LENGTH,START and END

(a)-[*]->(b) | (a)-[*3..]->(b) | (a)-[*..5]->(b) | (a)-[*3..5]->(b)

becomes

NODE a RELATION RIGHT LENGTH NODE b
NODE a RELATION RIGHT LENGTH START three NODE VARIABLE b
NODE a RELATION RIGHT LENGTH END five NODE b
NODE a RELATION RIGHT LENGTH START three END five NODE b

Some Clauses

As stated earlier, cypher queries are built up using various clauses. We will define some of them here.

MATCH (& WHERE)

MATCH specifies the patterns to search for in the database. This is the primary way of getting data and is often coupled to a WHERE clause. In our SNL it is used as a keyword. The RETURN clause keyword instead specifies the result set we want. In order to get properties of nodes or relationships in cypher a dot notation is used, we can get it with DOT in our SNL

MATCH (n) RETURN n | MATCH (movie:Movie) RETURN movie.title

becomes

MATCH NODE n RETURN n

MATCH NODE movie LABEL movie RETURN movie DOT title

The WHERE clause adds constraints to the pattern filtering it, some very complex expressions can be built with the where clause. For our purposes we will limit the expressions to simple basic operators like [+,-,*,/] and some basic boolean operators like NOT,OR,AND,XOR and the conditions > (keyword GREATER) and < (keyword LESS), we will also include some operators on strings like EQUALS, CONTAINS, ENDS WITH and STARTS WITH

MATCH (n) WHERE n.age < 30 RETURN n

MATCH (n) WHERE n.name STARTS WITH 'pet' RETURN n

becomes

MATCH NODE n WHERE n DOT age LESS thirty RETURN n

MATCH NODE n WHERE n DOT name STARTS WITH pet RETURN n

CREATE

The CREATE clause can be used to create graph elements such as nodes or relationships

CREATE (n:Person {name:'Andres',title:'developer'})

MATCH (a:Person),(b:Person) WHERE a.name='Andres' AND b.name='Jack' CREATE (a)-[r:KNOWS]->(b) return r

becomes

CREATE NODE a LABEL person MAP KEY name VALUE Andres KEY title VALUE developer

MATCH NODE a LABEL person NODE b LABEL person WHERE a DOT name EQUALS Andres AND b DOT name EQUALS Jack CREATE NODE a RELATION RIGHT r TYPE knows NODE b RETURN r

Similar to create there is the MERGE clause, that ensures a pattern exists, if not it will be created. Other clauses used are

DELETE/DETACH DELETE

It is used to delete graph elements, since you cannot delete a node without also deleting relationships that start or end on said node you can explicitly delete the relationships, or use the DETACH DELETE form.

MATCH (n{name:'Jack'}) DETACH DELETE n

MATCH NODE n MAP KEY name VALUE Jack DETACH DELETE n

SET

It is used to update labels and/or properties on nodes

MATCH (student:Student {name:'Stefan'}) SET student:German RETURN student

MATCH NODE student LABEL student MAP KEY name VALUE Stefan SET student LABEL german RETURN student

REMOVE

It is used to remove properties and labels from graph elements

MATCH (person:Person {name:'Andres'}) REMOVE person.age RETURN person

MATCH NODE person LABEL person MAP KEY name VALUE Andres REMOVE person DOT age RETURN person

WAV2VEC2

As said in the text to speech session we wanted to exploit some common APIs for speech to text in order to transform voice into one of our SNL queries. Suppose we have the following query (in my not so good english)

output.wav

That translates (or should in my mind) to

MATCH NODE movie LABEL movie RELATION LEFT r TYPE directed produced NODE person RETURN person r movie

And this is what two of the most common APIs (wit and google) return:

WIT:

men should not movie labeled movie relation left are type direct produced not person return person or movie

Google:

mention of movie full movie nation type directed produced person are movie

At first I thought that the problem was only my english pronunciation, so I tried a more simple query with a natural english voice

example.wav

and I got

WIT:

create node

Google:

create note

The problem here is that this kind of APIs use models trained on real natural language audio, so it is “normal” to miss for example a single letter word (like “n” in this case). Also it is a lot more common in the English language to create notes instead of nodes. This means that normal APIs for natural language are not useful for us. That said, we need to create a custom model that will correctly translate our SNL. In order to do that we will start from a pre-trained deep learning model on natural language and we will fine-tune it to understand our SNL, in this way we will need to build a way smaller dataset of audio queries. For the dataset I recorded 26 different queries from 4 different speakers, 2 synthetic natural English speakers, me and one friend of mine. 3 out of 4 speaker queries were used for training (so 69 queries) and the others (26) for validation. The model used is a wav2vec2 model. Very briefly the model is composed of three different parts:

Feature Extractor

A multi-layer convolutional feature encoder which takes as input raw audio and outputs latent speech representations T time-steps, during our fine-tuning we will freeze this part of the network.

Contextualized representations

A Transformer that builds representations (from the output of the Feature Extractor) which capture information from the entire sequence.

Quantization module

The output of the Feature Extractor is discretized with a quantization module representing the targets in the self-supervised objective. This is done through the Gumbel-Max Trick or Gumbel-Softmax. The paper results show that jointly learning discrete speech units with contextualized representations achieves substantially better results than fixed units learned in a prior step.

The pretrained model was trained masking some of the input to the Transformer in a similar way as it is done with BERT. Input is not masked instead for the quantization module. The Loss used during pre training was a composite loss, it was the sum of a contrastive loss used to learn representations of speech audio by identifying the true quantized latent speech representation for a masked time step within a set of distractors; and a diversity loss used to encourage the model to use entries equally often.

image credit: https://arxiv.org/pdf/2006.11477.pdf

For fine-tuning the network is trained differently, it uses a linear projection on top of the context network into C classes representing the vocabulary of the task. The model is then optimized by minimizing a CTC loss.

The model was fine-tuned with the pytorch framework and pytorch lightning flash library. The model was downloaded from the pretrained models in the hugging face site and it requires wav files with 16Khz sample rates.

PLY

As we now have a model to translate SNL speech queries into text, we now need to transform the text into a proper cypher query. We are going to do this by building a parser. Everyone who passed a compilers exam must have heard of lex and yacc at least once. Lex is a lexical analyzer that will perform the first tokenization stage (word analysis) of the SNL text. It will be followed by the parsing stage done with Yacc, for full syntactic analysis (phrase analyzer). Yacc and similar programs (largely reimplementations) have been very popular, since they were rewritten for other languages, including Java, Python, Ruby and Go. For our experiment we will use the python implementation of Lex and Yacc PLY since we are already using python for the wav2vec2 fine-tuning. The concept is practically the same as the C version of Lex and Yacc. You will need to define tokens for your grammar and that can be done with regex or a dictionary for specific keywords (keywords in SNL are also keyword tokens in our grammar). After that you will need specific rules that will construct the parse tree of the grammar. In C yacc you define a rule with lines like these:

E:E'+'E {$$=$1+$3;}

|E'-'E {$$=$1+$3;}

|E'*'E {$$=$1+$3;}

|E'/'E {$$=$1+$3;}

Where the left part is the rule itself and the right part is the “action” to be done when the rule is satisfied. In PLY you define a function with a first part as a string defining the rule, and use the body of the function to operate the “action”.

example of PLY yacc rule function

Conclusions and Considerations

Putting all things together now we have the possibility to record an SNL query speech from an hypothetical user (we can do that in python with the pyaudio python library) that will generate a 16 Khz output.wav file, we then will pass that file to our deep learning fine-tuned model of wav2vec2 that will output the corresponding text of the query. After that our PLY yacc parser will parse it and produce the equivalent cypher query. We can then run the query into a Neo4j database with the help of the Neo4j python driver. Here are some screenshot examples from a subset of the Movie database used as a playground graph in the Neo4j tutorials. The query result graph is then converted into a NetworkX Graph and printed with the Matplotlib library.

A video example of Speech to Cypher

Not all the Cypher grammar was taken into consideration, a lot of things were omitted, for example clauses like AS,WITH,CALL or variable types like lists and paths. It was done mainly for time reasons and because the point was just to present a basic speech to cypher demo and not a complete grammar conversion. Nonetheless there are clearly limitations of what can be done with voice regarding cypher queries. Let’s take as an example we want to match a node whose property matches a specific string field. We could not express with SNL if the String should be Title Case or Uppercase/Lowercase or a mix of everything with special characters, like in a nickname. In order to do that we would need to overcomplicate the grammar making longer speech queries and losing the benefits of voice. It is also clearly the case that too complex WHERE conditions are not well suited to be expressed in SNL. Another consideration is that strings of numbers are always converted to numbers at this time (e.g. “thirty one” became 31) but we would maybe require to have them in strings (like “31” or “thirty one”). In order to do that we would need to add primitive type keywords for variable values, something that I think is still acceptable considering what has been said. To close the circle, I think SNL is well suited only for not too long repetitive queries, the kind that are done when first analyzing a graph or checking simple nodes/relationships properties over a well known graph schema; since usually voice expressions should feel “natural” without too much thinking.

All the code for the experiment can be found here on GitHub

--

--