I think that most of us, software developers, while kids, always wanted to know how things were made by inside. Since I was a child, I always wanted to understand how my toys worked. And then, what I used to do? Opened’em, sure. And of course, later, I wasn’t able to re-join its pieces properly, but this is not this post subject ;). Well, understanding how things works behind the scenes can teach us several things, and in software this is no different, and we can study how an specific piece of code was created and mixed together with other code.

In this series of posts I’ll share what I’ve found inside Neo4J implementation, specifically, at Cypher’s code (its query language).

In this first part, I’ll briefly introduce Neo4J and Cypher and then I’ll start to explain the internals of its parser and how it works. Since it is a long (very very long subject, in fact), part 2 and subsequents are coming very very soon.

And BTW, this is a very long post, so, if you prefer, go grab a nice cup of coffee or tea, a jar of juice, something to eat, and then, come back to read.

First of all, what is Neo4J?

We are all very used to the relational paradigm, however, everyday, non relational databases receives more and more attention from the community. That happens for several reasons, including, better data modeling, improved scalability, sky rocketing performance enhancements and so on. Non relational databases became popular by the name of NOSQL, and several implementations exists, such as Riak, Redis, CouchDB, MongoDB and lots of others.

Each one of these implementations do have its particularities and some are better suited for a kind of problem and others are better suited to another kind of problems. Within these implementations, we can highlight Neo4J, which is a fully ACID compliant graph database, which means your data are as graphs (with nodes, relationships and properties on them), thus, having high speed through graph traversals (and we all know that storing graph like data inside a relational database is not a good idea).

For a more interesting introduction to Neo4J, you can check Emil Eifrem’s presentation and this nice InfoQ post.

How to find data on the graph?

So, let’s consider the following problem: “I want to know who are all the employees that paired programmed with a given programmer”.

In order to solve retrieve informations from the graph, Neo4J provides an API which allows us to traverse the graph and retrieve this information. With this traversal API, we can write Java code to describe how we want to traverse the graph:

TraversalDescription description = Traversal.description()
    .relationships(Relationships.PAIRED, Direction.OUTGOING)

description.traverse( startNode ); // Retrieves the traverser

Besides the traversal described above works pretty well, people may seem confused with all these Java code describing the traversal, and those who want to use Neo4J within JRuby, for instance, might have to adapt its code in order to achieve the same result. Another issue may be a Sysadmin wanting to perform ad-hoc queries on the database. How make this easily possible, without having to write Java code? Huumm, ohh well, why not we have a single query language which we can define traversals. Then comes Cypher, the Neo4J query language.

With Cypher, above’s traversal would be described as:

start programmer=(3) match (programmer)-[:PAIRED]->(pair) return pair

In this example, 3 means the starting node’s id of our query. We are also matching the relationships called PAIRED outgoing from the programmer and labeling the incoming node as pair. In the end, we’re telling we want to return all the retrieved pairs. That’s it.

We can even do more complex things on our traversal, for instance, we can apply filters to values, ordering, aggregation, pagination and so on. Thus, we can retrieve only the pairs whose age are more than 30 years (considering that we have an age property).

start programmer=(3) match (programmer)-[:PAIRED]->(pair) where pair.age > 30 return pair, count(*) order by age skip 5 limit 10

Imagine doing all these things, writing the Java code. Sure it is possible, however, not so straightforward as Cypher. And believe me, there is much more combinations and complex queries you can do.

Defining Cypher

Cypher is an external DSL, just like HQL (Hibernate Query Language) and the AspectJ syntax, for instance. Therefore, somewhere it needs to be parsed, its correctness must be checked and then it should be executed, in Cypher’s case, retrieving the results we asked for.

But, in order to write an external DSL, we should define its grammar, which will give the guidelines of how the language is supposed to be structured and what is and isn’t valid. In order to express this definition, we can use some variation of EBNF syntax, which provides us a clear way to expose the language definition.

First of all, let’s choose a small subset of Cypher to describe, which will be the focus of this first post, in this case, the start clause, where we have defined the id we wanted to retrieved and we named our starting point programmer. Taking a look at the EBNF definition, we would have something like the following:

start ::= "start" {"," nodeByIds | nodeByIndex | nodeByIndexQuery | relsByIds | relsByIndex }

nodeByIds ::= identity "=" "(" {"," number} ")"

nodeByIndex ::= identity "=" "(" identity "," identity "," string ")"

nodeByIndexQuery ::= identity "=" "(" identity "," string ")"

relsByIds ::= identity "=" "<" { "," number } ">"

relsByIndex ::= identity "=" "<" identity "," identity "," string ">"

Above, identity is an identifier, for instance, the “programmer” identifier, “{}” means that whatever is inside, can appears repeatedly and with a separator, while | is denoting alternative production.

With that information, now we know that in order to write a proper start clase, we can’t use semi-colon as a separator, we must use parenthesis to group the ids, and we can have as much ids as we want. Easy right?

Now that we have a grammar defined for the start clause, we must have a way put it into action, which mean, we must parse a given instruction, check its validity and then take some action with what instruction was informed.

Parsing, Cypher and Parser Combinators

One possibility to parse the query is to use a parser generator tool, like Antlr or YACC. However, some functional programming languages, such as Scala and Haskell, provides an embedded tool to make possible to parse these languages. That tool is known as Parser Combinator. In this case, Cypher uses Scala’s Parser Combinator library. If you don’t have any familiarity with Parsers and Parser Combinators, you can check this excellent article from Odersky and Moors.

Remember the start clause previously described? We can describe it in a very similar way in Scala. In fact, Neo4J’s Cypher codebase have a trait, which is very similar to a Java interface, called StartClause, with the definition we did above, with just a few changes on the syntax we used.

def start: Parser[Start] = ignoreCase(&quot;start&quot;) ~&gt; rep1sep(nodeByIds | nodeByIndex | nodeByIndexQuery | relsByIds | relsByIndex, &quot;,&quot;)

def nodeByIds = identity ~ &quot;=&quot; ~ &quot;(&quot; ~ rep1sep(wholeNumber, &quot;,&quot;) ~ &quot;)&quot;

def nodeByIndex = identity ~ &quot;=&quot; ~ &quot;(&quot; ~ identity ~ &quot;,&quot; ~ identity ~ &quot;,&quot; ~ string ~ &quot;)&quot;

def nodeByIndexQuery = identity ~ &quot;=&quot; ~ &quot;(&quot; ~ identity ~ &quot;,&quot; ~ string ~ &quot;)&quot;

def relsByIds = identity ~ &quot;=&quot; ~ &quot;
def relsByIndex = identity ~ &quot;=&quot; ~ &quot;

The code above looks very similar to the EBNF alike we did before, we just translated the syntax to Scala language, except about a few nuances. For instance, in this example, we use the ~ method, that represents a sequence, before we represented the sequence, simply placing elements from left to right. We also used the rep1sep method means a repetition with a separator (the same effect of “{}” on the EBNF example).

However, where are these ~ and rep1sep methods defined? In fact, looking at the StartClause trait definition, we have:

trait StartClause extends JavaTokenParsers with Tokens {
  // code here

The JavaTokenParsers trait itself, comes from the scala.util.parsing.combinator.Parsers trait (keep that in mind, it will be very important in the future), which defines ~ and the rep1sep amongst some other methods which we’ll talk later.

So far, so good, but I’ve cheated a little bit, the history (and the code) is not yet complete.

While parsing the given input (a query sample) against the defined grammar, we can get back the results, so we can process it later, for instance, we can grab the identifier used in the start clause, node ids and so on, and then, with these information, it is possible to perform the desired traversal.

Scala’s Parser Combinator allows us to grab all these informations in the form of a Semantic Model, which can better describe these informations, thus, instead of grabbing bare Strings and a List of node ids, it is possible transform them in an object, representing these informations, for instance, in the case of a query by ids, we could get an instance of a NodeById class, or in the case of querying by index, we could get a NodeByIndex instance. And Hell Yeah, we can do this, by using the ^^, which allow us to apply a function against the parser result, allowing us to build the Semantic Model we want as the AST (Abstract Syntax Tree). In this case, the portion of the tree that represents querying by a node id, is an object of the NodeById class.

We can use the ^^ operator in a pattern matching way, matching the parts of the query we want, in order to form any other result, which in the case of the nodeByIds method in Cypher’s StartClause, is the following:

def nodeByIds = identity ~ &quot;=&quot; ~ &quot;(&quot; ~ rep1sep(wholeNumber, &quot;,&quot;) ~ &quot;)&quot; ^^ {
  case varName ~ &quot;=&quot; ~ &quot;(&quot; ~ id ~ &quot;)&quot; =&gt; NodeById(varName, _*)

What happens in above code is that, it matches against what was got during the parsing, and varName will be the variable name used on the query, while the id is a List of ids, which later gets transformed to an splat (varargs) of Longs, in the somewhat strange _* if your are not used to Scala, and passed to the NodeById constructor. The NodeById instance is then returned from the nodeByIds method.

However, since the class StartClause is composed by the scala.util.parsing.combinator.Parsers trait, an implicit conversion defined there, will promote the NodeById object that is returning to an object of Parser[NodeById], so it can keep parsing the rest of the input (the match clause, where clause and so on, recursively, other parts, that will be covered on the next post).

Taking a look at the NodeById class, we can see its definition, inside the StartItem.scala file:

case class NodeById(varName:String, id: Long*) extends NodeStartItem(varName)

abstract class NodeStartItem(varName:String) extends StartItem(varName)

abstract sealed class StartItem(val variable:String)

Notice 2 important things here: First, that these classes has no behavior, they just hold the data for the Semantic Model. And looking at the same file, is possible to find all the other classes used to form the Semantic Model of the Start Clause, such as, NodeByIndex, RelationshipById and so on. And second, there is a hierarchy between them. The NodeById class extends the NodeStartItem which comes from the StartItem. Looking further in the StartItem.scala file, we see that all classes that form this Semantic Model, follow this hierarchy. The complete code of the StartItem.scala file, looks like:

abstract sealed class StartItem(val variable:String)

abstract class RelationshipStartItem(varName:String) extends StartItem(varName)

abstract class NodeStartItem(varName:String) extends StartItem(varName)

case class RelationshipById(varName:String, id: Long*) extends RelationshipStartItem(varName)

case class NodeByIndex(varName:String, idxName: String, key:String, value: Any) extends NodeStartItem(varName)

case class NodeByIndexQuery(varName:String, idxName: String, query: Any) extends NodeStartItem(varName)

case class RelationshipByIndex(varName:String, idxName: String, key:String, value: Any) extends RelationshipStartItem(varName)

case class NodeById(varName:String, id: Long*) extends NodeStartItem(varName)

This complete code, makes easier to see the hierarchy existing between the classes involved in the AST definition.

Back to the StartClause trait

Taking a further look at the other methods inside the StartClause trait, such as, the nodeByIndex, nodeByIndexQuery, the rules are the same (except for the start method, that I’ll talk next).

The start method follow the same rules as the other ones, however, some soe far unexplained things appears on it, which is the ~> and | combinators.

def start: Parser[Start] = ignoreCase(&quot;start&quot;) ~&gt; rep1sep(nodeByIds | nodeByIndex | nodeByIndexQuery | relsByIds | relsByIndex, &quot;,&quot;) ^^ (Start(_: _*))

The ~> is the selective sequence combinator, which in this case, means, “ignore what is immediately in the left”, in this case, the “start” keyword, which doesn’t mean anything on our Semantic Model. After this, we have a repetition of nodeByIds or nodeByIndex or nodeByIndexQuery and so on, all using a comma separator. Notice that we are combining the five parsers right now (Someone told Parser Combinators?) in order to create the whole start clause.

The last thing the method does is, to transform the list of instances of StartItem (remember the hierarchy defined on the StartItem.scala file?) into a splat (varargs) and pass them to the Start constructor.

A small clarification, for those not used to Scala, is the Start(_: _*). What this code does is, call the Start class constructor, passing to it an splat (did you see the _*?) of _, which in this case, represents a list of StartItems.

And that’s it, now we have the start clause of Cypher demystified. And we even understand how it is composed and works. Now we are half way to understand how the other parts of Cypher are composed. Phew!!!

1st part’s conclusion

At this first part, I hope it was possible to understand how a small subset of Cypher is defined, and how it uses Scala’s parser combinator for it. But there’s a lot of other things to see, such as, how the others clauses of Cypher are tied together, how error reporting is done, how the query is executed after the parse process happens and so on.

Special thanks to @porcelli, for some kind suggestions on this post.