Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

Recalling the first post

During the previous post, I’ve explained what is Neo4j and then, explained how graph traversal could be done on Neo4j using the Java API. Next, I’ve introduced Cypher and how it helped write queries, in order to retrieve data from the graph. After introducing Cypher’s syntax, we dissected the Start Clause, which is the start point (duh) for any query being written on Cypher. If you hadn’t read it, go there, and then come back to read this one.

In this second part, I’ll show the other clauses existents in Cypher, the Match, Where, Return, Skip and Limit, OrderBy and Return. Some will be simple, some not and I’ll go in a more detailed way on those clauses that aren’t so trivial. After that, we will take a look at the Cypher query entry point, and how the query parsing is unleashed.

Nuff said, let’s get down to business.

The other clauses

At the moment of this writing, Neo4j is composed by 6 clauses: Start (seen in the first part of this series), Match, Where, Return, OrderBy and Skip Limit. Recalling the first post, we used a sample Cypher query to show its syntax and it also showed all 6 clauses in action. Let’s see it again:

start programmer=(3) match (programmer)-[:PAIRED]->(pair) where pair.age > 30 return pair, count(*) order by age skip 5 limit 10

Since start clause was already explained, let’s go, clause by clause, form left-to-right, starting by the match clause.

The Match Clause

Probably the most complicated clause. It’s implementation may be somewhat confusing at a first sight, however, looking calmly, it is not that scary. So, let’s go step by step.

First of all, the Match Clause is where we define patterns of paths we want to query information for. These paths can be directed, we can identify them, it is possible to define the relationship types of these paths and so on. Taking a look at our sample query, we can highlight only the match part of the query:

match (programmer)-[:PAIRED]->(pair)

This is a simple case, where we are defining the path from the programmer with the PAIRED relationship incoming at a pair. So far so good, however, this is easy to get complex, for instance, if we don’t care about one side of the relationship, we can omit its identifier:

match (programmer)-[:PAIRED]->()

And if we care about the relationship, as in case of returning it (more about returns in the future), we can identify it:

match (programmer)-[relationship:PAIRED]->()

It is possible to define several patterns and multiple relationships, such as:

match (programmer)-[:PAIRED]->(pair)<-[:FRIEND]-(rose)

Which in this case matches the programmer that paired with someone who is Rose’s friend.

In order to we get it in a simple way, we check the definition of the methods at the MatchClause trait, so we can see how this grammar is defined:

trait MatchClause extends JavaTokenParsers with Tokens {

  def matching: Parser[(Match, NamedPaths)] = ignoreCase("match") ~> rep1sep(path, ",")

  def path: Parser[Any] = pathSegment | parenPath | noParenPath

  def parenPath: Parser[NamedPath] = identity ~ "=" ~ "(" ~ pathSegment ~ ")"

  def noParenPath: Parser[NamedPath] = identity ~ "=" ~ pathSegment

  def pathSegment: Parser[List[Pattern]] = node ~ rep1(relatedTail)

  def node: Parser[Option[String]] = parensNode | relatedNode

  def parensNode: Parser[Option[String]] = "(" ~> opt(identity)
  def relatedNode: Parser[Option[String]] = opt(identity)

  def relatedTail = opt(" relationshipInfo
  def relationshipInfo = opt(identity) ~ opt("?") ~ opt(":" ~> identity) ~ opt("^" ~ wholeNumber ~ ".." ~ wholeNumber)
}

We can see the entry point of the match clause, defined by the matching method, at line 3, which expects a match keyword followed by a repetition of path, whose definition is at line 5, which is defined by either a pathSegment, parenPath or a noParenPath.

For those that don’t know Scala very well, the Option means that the object is optional, and if it not exists, when we have the possibility to return an alternative object.

The parenPath and noParenPath are easy, they are paths defined either with or without parenthesis. While the pathSegment is defined by a node followed by a repetition of a relatedTail. The node is defined by optional parenthesis and identifier.

The relatedTail parser is where the relationship itself is defined. With an optional incoming (<) direction, a mandatory – (dash), an optional relationshipInfo, which must appear inside brackets and followed by another dash, an optional outgoing (>) direction, also followed by a node definition.

Finishing the grammar definition, the relationshipInfo is defined by an optional relationship identifier (in case you want to refer to it in the future, for instance, returning it), followed by an also optional question mark, which identify an optional relationship. Following, you can define the relationship name you want to match and it is followed by an optional ^ (circumflex) sign with two numbers seperated by “..”, which is used to identify hops that can be matched, for instance ^1..4 (from 1 to 4 hops).

Phew, huge syntax, means… huge transformations into an AST. Let’s get to it, starting from the bottom to the top.

First, let’s take a look at the relationshipInfo method’s definition:

  def relationshipInfo: Parser[(Option[String], Option[String], Option[(Int, Int)], Boolean)] = opt(identity) ~ opt("?") ~ opt(":" ~> identity) ~ opt("^" ~ wholeNumber ~ ".." ~ wholeNumber) ^^ {

    case relName ~ optional ~ Some(relType) ~ None => (relName, Some(relType), None, optional.isDefined)

    case relName ~ optional ~ Some(relType) ~ Some("^" ~ minHops ~ ".." ~ maxHops) => (relName, Some(relType), Some((minHops.toInt, maxHops.toInt)), optional.isDefined)

    case relName ~ optional ~ None ~ Some("^" ~ minHops ~ ".." ~ maxHops)  => (relName, None, Some((minHops.toInt, maxHops.toInt)), optional.isDefined)

    case relName ~ optional ~ None ~ None => (relName, None, None, optional.isDefined)
  }

The method returns a parser of a tuple containing 4 informations, which will be retrieved through the transformation applied on the input. The pattern on the first case, checks if the relationship name is passed, the question mark and the relationship type have been informed and the hops part of the match was omitted. If so, it builds a tuple containing all these informations retrieved in the case. The other 3 cases are variations of the first one, checking the presence and absence of other informations and transforming them into the same Scala tuple.

The next one is the relatedTail method, which defines the relationship direction and the right side node. This parser returns a tuple with informations regarding the direction and also all of the previous relationshipInfo parser.

  def relatedTail = opt(" relationshipInfo     case back ~ "-" ~ relInfo ~ "-" ~ forward ~ end => relInfo match {
      case Some((relName, relType, varLength, optional)) => (back, relName, relType, forward, end, varLength, optional)
      case None => (back, None, None, forward, end, None, false)
    }
  }

The node itself is defined by the following parsers:

  def node: Parser[Option[String]] = parensNode | relatedNode

  def parensNode: Parser[Option[String]] = "(" ~> opt(identity)  throw new SyntaxException("Matching nodes without identifiers have to have parenthesis: ()")
    case x => x
  }

At above’s code, we can see that a node can be defined with or without parenthesis. However, when not caring about the node, thus, not identifying it, you must at least use the parenthesis, otherwise, SyntaxException will be thrown.

Now, it is possible to take a look at how the paths are transformed. At path definition, we saw that it can be composed by either path with and without parenthesis or a pathSegment.

  def path: Parser[Any] = pathSegment | parenPath | noParenPath

  def parenPath: Parser[NamedPath] = identity ~ "=" ~ "(" ~ pathSegment ~ ")" ^^ {
    case p ~ "=" ~ "(" ~ pathSegment ~ ")" => NamedPath(p, pathSegment: _*)
  }

  def noParenPath: Parser[NamedPath] = identity ~ "=" ~ pathSegment ^^ {
    case p ~ "=" ~ pathSegment => NamedPath(p, pathSegment: _*)
  }

Looking at the code, both parenPath and noParenPath parsers are doing simple transformations to create NamedPath objects with the identity (the p variable) and the pathSegment, which contains all the information of the path.

The pathSegment parser is a little more complicated and it parses for the node followed by the repetition of the relatedTail (already defined above).

  def pathSegment: Parser[List[Pattern]] = node ~ rep1(relatedTail) ^^ {
    case head ~ tails => {
      var fromNode = namer.name(head)

      val list = tails.map(_ match {
        case (back, rel, relType, forward, end, varLength, optional) => {
          val toNode = namer.name(end)
            val dir = getDirection(back, forward)

          val result: Pattern = varLength match {
            case None => RelatedTo(fromNode, toNode, namer.name(rel), relType, dir, optional)
            case Some((minHops, maxHops)) => VarLengthRelatedTo(namer.name(None), fromNode, toNode, minHops, maxHops, relType, dir, optional)
          }

          fromNode = toNode

          result
        }
      })

      list
    }
  }

The first thing is to match the node part and the relatedTail, interestingly identified as head and tails, respectively. Then it finds the name of the node. However, remember that it can be unnamed, which in this case, a little trick is done inside the name method at line 3.

  class NodeNamer {
    var lastNodeNumber = 0

    def name(s: Option[String]): String = s match {
      case None => {
        lastNodeNumber += 1
        "  UNNAMED" + lastNodeNumber
      }
      case Some(x) => x
    }
  }

The NodeNamer class creates a counter for the unnamed nodes, so it can uniquely name them, and if the node is already named, it will return the given name itself. Small curiosity here, is that one might identify a node as UNNAMED1 and also not identify another node in the query, which will make two nodes being called UNNAMED1 ;).

Continuing on the pathSegment method, between line 5 and 19 the list of the related tails is transformed into a list of Pattern (the result val created at line 10). These results are aggregated by the map method and in the end, is returned to the list val at line 5, which is returned at line 21.

All of these operations are unleashed by the matching clause, which transforms all the previously parsed pieces in a tuple of Match and NamedPaths:


  def matching: Parser[(Match, NamedPaths)] = ignoreCase("match") ~> rep1sep(path, ",") ^^ {
    case matching => {
      val unamedPaths: List[Pattern] = matching.filter(_.isInstanceOf[List[Pattern]]).map(_.asInstanceOf[List[Pattern]]).flatten
      val namedPaths: List[NamedPath] = matching.filter(_.isInstanceOf[NamedPath]).map(_.asInstanceOf[NamedPath])

      (Match(unamedPaths: _*), NamedPaths(namedPaths: _*))
    }
  }

Phew… long one! Tired? C’mon, there is more clauses to tackle, but thankfully, the most extense is over.

The Where Clause

Ahh the where clause. With their longs 3 lines of code, seems very simple and think that can fool us, take a look:


trait WhereClause extends JavaTokenParsers with Tokens with Clauses {
  def where: Parser[Clause] = ignoreCase("where") ~> clause
}

That’s it. But digging deeper, what the heck is clause? In fact, it is a definition derived by the Clauses trait, that is almost self explanatory, so let’s check just the main clause:


  def clause: Parser[Clause] = (orderedComparison | not | notEquals | equals | regexp | hasProperty | parens | sequenceClause) * (
    ignoreCase("and") ^^^ { (a: Clause, b: Clause) => And(a, b)  } |
    ignoreCase("or") ^^^  { (a: Clause, b: Clause) => Or(a, b) }
    )

It takes possibilities for the where clause, such as >, >=, not equals (that can be both <> or != and so on. The * combinator, means that it will be interleaved with the right side (the and/or) and the right side will determine how the transformation will be made, in that case, in a And or Or objects, with both left and right side clauses. By the way, the ^^^ transformation operator, means that it is changing the successful result by what is following it, in this case, the And or Or objects.

You can take a look at the Clauses trait in the repository for more details on the syntax. There is no other new things on the trait, so, by now, it might be a good exercise to do.

The Return Clause

How Cypher parses what we want our query to return? The answer for this question is in the ReturnClause trait. It is a simple parse, without tricks, so let’s get into the trait:

trait ReturnClause extends JavaTokenParsers with Tokens with ReturnItems {

  def returns: Parser[(Return, Option[Aggregation])] = ignoreCase("return") ~> opt("distinct") ~ rep1sep((aggregate | returnItem), ",")

The returns method is defined by a return keyword, followed by an optional distinct definition with a repetition of an aggregate or a returnItem, which are defined in the ReturnItems trait, which defines the aggregate functions, such as min(), max(), avg() and so on and the possible returnValues, such as an entity, a value or a nullableProperty.

trait ReturnItems extends JavaTokenParsers with Tokens with Values {
  def returnItem: Parser[ReturnItem] = returnValues ^^ {
    case value => ValueReturnItem(value)
  }

  def returnValues: Parser[Value] = nullableProperty | value | entityValue

  def countFunc: Parser[AggregationItem] = ignoreCase("count") ~> parens(returnValues) ^^ { case inner => ValueAggregationItem(Count(inner)) }
  def sumFunc: Parser[AggregationItem] = ignoreCase("sum") ~> parens(returnValues) ^^ { case inner => ValueAggregationItem(Sum(inner)) }
  def minFunc: Parser[AggregationItem] = ignoreCase("min") ~> parens(returnValues) ^^ { case inner => ValueAggregationItem(Min(inner)) }
  def maxFunc: Parser[AggregationItem] = ignoreCase("max") ~> parens(returnValues) ^^ { case inner => ValueAggregationItem(Max(inner)) }
  def avgFunc: Parser[AggregationItem] = ignoreCase("avg") ~> parens(returnValues) ^^ { case inner => ValueAggregationItem(Avg(inner)) }
  def countStar: Parser[AggregationItem] = ignoreCase("count") ~> parens("*") ^^ { case "*" => CountStar()  }

  def aggregate:Parser[AggregationItem] = (countStar | countFunc | sumFunc | minFunc | maxFunc | avgFunc)
}

Getting back to the ReturnClause trait, let’s take a look at the complete code and examine the transformation:

  def returns: Parser[(Return, Option[Aggregation])] = ignoreCase("return") ~> opt("distinct") ~ rep1sep((aggregate | returnItem), ",") ^^
    { case distinct ~ items => {
      val list = items.filter(_.isInstanceOf[AggregationItem]).map(_.asInstanceOf[AggregationItem])

      val none: Option[Aggregation] = distinct match {
        case Some(x) => Some(Aggregation())
        case None => None
      }

      (
        Return(items.filter(!_.isInstanceOf[AggregationItem]): _*),
        list match {
          case List() => none
          case _ => Some(Aggregation(list : _*))
        }
      )
    }}

The transformation creates a list of AggregationItems and assign it to the list val at line 3. Then it does a matching against the distinct part, and if informed, assign an aggregation to the none val, otherwise, it will assign a None, which is an object that represents no value, in Scala.

At line 11, it prepares the result produced by the parser, which must by a tuple between a Return object and the Option[Aggregation] indicating that the distinct was or wasn’t set.

In regard to the Return, it takes all items defined to be returned and returns only those that aren’t aggregation items. While all the aggregation items are store now in the list val, which is matched at line 12. In case of being an empty list, just return the result of the aggregation checked, otherwise, take all the AggregationItems, and store it inside the Aggregation object to be returned within the tuple. The result will be a (Return, Option[Aggregation]) tuple.

The OrderBy Clause

We can always define how the results will be ordered and by which results. Parsing the order by part of the query is the job of the OrderByClause trait, which defines the order method.

def order: Parser[Sort] = ignoreCase("order by")  ~> rep1sep(sortItem, ",") ^^
    {
      case items => Sort(items:_*)
    }

Nothing new here, therefore, let’s check the sortItem definition.

  def sortItem :Parser[SortItem] = (aggregate | returnItem) ~ ascOrDesc ^^ {
    case returnItem ~ reverse => {
      returnItem match {
        case ValueReturnItem(EntityValue(_)) => throw new SyntaxException("Cannot ORDER BY on nodes or relationships")
        case _ => SortItem(returnItem, reverse)
      }
    }
  }

The aggregate is where we can use a sum() or max() function to return something, while a returnItem can be a nullableProperty or an identifier, and after it, is verified if the ordering must be asc or desc.

In the case of the returnItem being a node or a relationship, the SyntaxException is thrown, since it doesn’t make sense on ordering in anything that is not an aggregate or a property.

In the end, the Sort object with the items to where the sorting will be performed is generated to the AST.

The Skip and Limit Clause

A simple one to finish this combo. The <>SkipLimitClause is the simplest clause amongst all the other ones. It’s composed by to definitions, the skip and the limit:

trait SkipLimitClause extends JavaTokenParsers with Tokens {
  def skip: Parser[Int] = ignoreCase("skip") ~> positiveNumber ^^ { case startAt => startAt.toInt }

  def limit: Parser[Int] = ignoreCase("limit") ~> positiveNumber ^^ { case count => count.toInt }
}

Taking a look at the code, we can see in both instructions, the keywords (skip and limit) being ignored and we can also see that it demands a positive number. And the transformation only converts the result, which is a String, to an Int.

The entry point, where everything starts. Also, where everything ends

Now that we know how each clause is defined, it is possible put them in sequence. So, in the query first will come the Start clause, then the Match clause and so on. This is one of the responsibilities of the CypherParser class, which is defined as:

class CypherParser extends JavaTokenParsers
with StartClause
with MatchClause
with WhereClause
with ReturnClause
with SkipLimitClause
with OrderByClause
with StringExtras {
   // code will come here
}

Note that the CypherParser class is composed by all of the clause’s before discussed traits, therefore, making possible the use of the definitions made on each clause trait. Continuing on the CypherParser class, we have the query method, which uses the pieces defined before:

def query: Parser[Query] = start ~ opt(matching) ~ opt(where) ~ returns ~ opt(order) ~ opt(skip) ~ opt(limit)

Taking a look at the query method, we can see that the whole query itself, is defined by a mandatory start clause, an optional match clause and where, followed by a mandatory return clause and an optional skip and limit clauses. All of these small methods were explained before in this post (again, the start was explained on the first post).

Now that the query is defined, we should transform it in a AST and the way this is done in Cypher is as the following:

  def query: Parser[Query] = start ~ opt(matching) ~ opt(where) ~ returns ~ opt(order) ~ opt(skip) ~ opt(limit) ^^ {

    case start ~ matching ~ where ~ returns ~ order ~ skip ~ limit => {
      val slice = (skip,limit) match {
        case (None,None) => None
        case (s,l) => Some(Slice(s,l))
      }

      val (pattern:Option[Match], namedPaths:Option[NamedPaths]) = matching match {
        case Some((p,NamedPaths())) => (Some(p),None)
        case Some((Match(),nP)) => (None,Some(nP))
        case Some((p,nP)) => (Some(p),Some(nP))
        case None => (None,None)
      }

      Query(returns._1, start, pattern, where, returns._2, order, slice, namedPaths)
    }
  }

One step at a time. First thing that is done is the pattern matching against the result, therefore, checking that everything is in place. Then, some specific matches are done. The first one against a (skip, limit) tuple at line 4. If none of them is passed, a None object is returned and assigned to the slice val at line 4. Otherwise, a Some object containing a Slice with the skip and limit values is assigned to slice. The Slice class definition is as simple as:

case class Slice(from: Option[Int], limit: Option[Int])

The second operation being done on the code, starting at line 9, deals with the match clause that was done. It matches if the match clause was done with and/or without named paths and patterns. If both have been used, it creates a Some object containing both the pattern and the named paths used, if just one of them is defined, the unused one will become a None object. In the end, both will be assigned to a tuple representing the pattern used and the named paths used.

Then, a Query object is constructed, with all the results of the before applied parsers. And we now have a complete AST, with the Query as its starting point. We can see the Query class definition:

case class Query(returns: Return, start: Start, matching: Option[Match], where: Option[Clause], aggregation: Option[Aggregation],
                 sort: Option[Sort], slice: Option[Slice], namedPaths: Option[NamedPaths])

The only missing piece to reach at the CypherParser class is the parse(queryText: String) method, whose work is to simply delegate the queryText processing to Scala’s Parser Combinators library. In case of success, the result (the Query object) is returned. If it is not successful, the appropriate error message is used to indicate that the parsing process failed. Below, is the complete code for the parse method:

  @throws(classOf[SyntaxException])
  def parse(queryText: String): Query = {
    val MissingQuoteError = """`\.' expected but `.' found""".r
    val MissingStartError = """string matching regex `\(\?i\)\\Qstart\\E' expected.*""".r
    val WholeNumberExpected = """string matching regex `\\d\+' expected.*""".r
    val StringExpected = """string matching regex `'\(\[\^'\\p\{Cntrl\}\\\\\]\|\\\\\[\\\\\/bfnrt\]\|\\\\u\[a-fA-F0-9\]\{4\}\)\*'' .*""".r

    parseAll(query, queryText) match {
      case Success(r, q) => r
      case NoSuccess(message, input) => message match {
        case MissingQuoteError() => fail(input, "Probably missing quotes around a string")
        case MissingStartError() => fail(input, "Missing START clause")
        case WholeNumberExpected() => fail(input, "Whole number expected")
        case StringExpected() => fail(input, "String literal expected")
        case "string matching regex `-?\\d+' expected but `)' found" => fail(input, "Last element of list must be a value")
        case "string matching regex `(?i)\\Qreturn\\E' expected but end of source found" => throw new SyntaxException("Missing RETURN clause")
        case _ => throw new SyntaxException(message)
      }
    }
  }

Now we have a complete parser. Which does parse the whole query and generates an AST or an error message if something went wrong during the parsing process.

Coming up next

After the parsing process happened successfully, we have built all the AST. Which means, it is now possible to take some action over it. However, we are not sure that the generated AST will be what we were expecting. We are not sure that the parser works correctly (it works, but just pretend by now that we are not sure ;) ). In the next part, I’ll cover how the unit testing of the AST was made and how Neo4j guys did an amazing job to generate documentation through the tests.

About these ads

2 Responses to Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point

  1. Pingback: Neo4j’s Cypher internals – Part 2: All clauses, more Scala’s Parser Combinators and query entry point « Another Word For It

  2. Pingback: Neo4j Cypher internals | Notes on Graph Databases

Deixe uma resposta

Preencha os seus dados abaixo ou clique em um ícone para log in:

Logotipo do WordPress.com

Você está comentando utilizando sua conta WordPress.com. Sair / Alterar )

Imagem do Twitter

Você está comentando utilizando sua conta Twitter. Sair / Alterar )

Foto do Facebook

Você está comentando utilizando sua conta Facebook. Sair / Alterar )

Foto do Google+

Você está comentando utilizando sua conta Google+. Sair / Alterar )

Conectando a %s

Seguir

Obtenha todo post novo entregue na sua caixa de entrada.

%d blogueiros gostam disto: