Ruminations of a Programmer: External DSLs made easy with Scala Parser Combinators

Monday, April 14, 2008

External DSLs made easy with Scala Parser Combinators

External DSLs are hard since implementing them involves reinventing most of the mechanisms found in a general purpose language. Designing internal DSLs are equally hard, more so in a statically typed language. Dynamically typed languages like Ruby offer strong meta-programming facilities, which help in implementing internal DSLs. But metaprogramming in Ruby is still considered elitist by many, and is not an art mastered by programmers at large.

Parser combinators offer a unique value here. They allow programmers to write executable grammars, in the sense that designing and implementing a DSL is almost equivalent to writing the EBNF notations in the syntax of the native language. So what really are parser combinators and what kind of language support do we need to implement parser combinator libraries ? Here is how Gilad Bracha describes them ..

The basic idea is to view the operators of BNF (or regular expressions for that matter) as methods that operate on objects representing productions of a grammar. Each such object is a parser that accepts the language specified by a particular production. The results of the method invocations are also such parsers. The operations are called combinators for rather esoteric technical reasons (and to intimidate unwashed illiterates wherever they might lurk).

Combinators have their theoretical underpinnings in functional programming. A parser combinator is a higher order function that accepts a parser and applies transformation functions to generate more complex parsers. Hence parser combinators are easily implemented in languages that have strong support for functional programming. In Scala, parsers are implemented as monads - hence defining combinators for parsers are just monadic transformations implementing sequencing, alternation or any other composition operations.

This post is not an introductory material on parser combinators or their implementations in Scala. Here I would like to narrate my experience in designing an external DSL for a financial application using the parser combinator library of Scala. In one of my earlier posts, I had talked about monads as abstractions for containers and computations. Parser combinator implementation in Scala is a great example of the power of monads in evolving a DSL.

In developing a financial application involving securities trade and settlement processing, we've been using XML as the DSL for getting buy/sell orders from client, trade/execution information from the exchange and settlement information from clearing agencies. Needless to say, XML processing is one of the key functions that pervade our codebase. In one of my very early posts, I had ranted about executable XMLs (aka Lisp) and had considered using Scheme as the DSL for securities trade processing operations. SISC offers a fully compliant Scheme implementation on top of the JVM, still all ideas fell through the cracks of enterprise decision making deliberations. After a fairly long time, I have found out another alternative - DSLs designed using Scala parser combinators ..

easy to implement

as concise as your EBNF productions

algebraic data types to generate ASTs and

powerful pattern matching techniques to inspect them.

Add to them the fact that I can have the entire stack running on the JVM, with Java objects still running the show at the backend. This imples that I do not have to reimplement my current Java application. I can just plug in the DSL and have the parser cook up my Java objects at the AST level. And that is exactly what I plan to do here.

Here is a sample DSL (simplified for brevity) for accepting client orders to buy/sell equities ..

(buy 100 IBM shares at max USD 45, sell 50 CISCO shares at min USD 25, buy 100 Google shares at max USD 800) for trading_account "SSS1234"

The equivalent XML will be too verbose, too painful for the eyes, and will definitely need more extraneous infrastructure over native language support for meaningful processing.

and here is the Scala parser for recognizing the DSL ..

import scala.util.parsing.combinator.syntactical._
object OrderDSL extends StandardTokenParsers {
  lexical.delimiters ++= List("(", ")", ",")
  lexical.reserved += ("buy", "sell", "shares", "at", "max", "min", "for", "trading", "account")

  def instr = trans ~ account_spec

  def trans = "(" ~> repsep(trans_spec, ",") <~ ")"

  def trans_spec = buy_sell ~ buy_sell_instr

  def account_spec = "for" ~> "trading" ~> "account" ~> stringLit

  def buy_sell = ("buy" | "sell")

  def buy_sell_instr = security_spec ~ price_spec

  def security_spec = numericLit ~ ident ~ "shares"

  def price_spec = "at" ~ ("min" | "max") ~ numericLit
}

This is really all that I need to parse my DSL. Really. And the most interesting part is that, the methods above have almost a one-to-one correspondence to EBNF production rules, as I would write them in a natural language. All the heavy lifting of lexical analysis and parsing are taken care of by the Scala parser combinator library.

The combinators used in the above example look like operators, though, actually they are Scala methods. Every combinator method works on a portion of the input, parses it and may optionally pass on the remaining part to the next combinator in the chain. e.g. the sequencing combinator ~ composes two parsers sequentially. In the above example, for the first production, trans ~ account_spec succeeds only if trans succeeds and then account_spec succeeds on the portion of the input left by trans. And the final result is another parser on which an optional function application combinator (^^) can work, applying the function on the result of the sequencing combinator.

Once I have the basic parsing productions defined in Scala, I can work towards building my abstract syntax tree (AST), which will accumulate necessary parsed information and provide me the model of the abstraction that the DSL embodies. This model depends on how I would like to process the abstraction defined by the language, and may vary based on the requirements of the backend system which receives the AST. In the above example, I may like to abstract the client order details into a POJO and pass on to the database layer for persistence. Alternatively I may like to pass on the AST to the pretty printer function, which generates an HTML for confirming the client order. Hence, it is always better if we can decouple the two concerns - recognising the language and processing the information to generate the AST. Gilad Bracha talks about similar decoupling in Newspeak using a combination of closures and inheritance. But Newspeak is a dynamic language and I am not sure if this decoupling can be achieved in a statically typed language like Scala.

Hence, in Scala it is not possible to ensure that multiple back end systems share the same grammar instance while working on different models of the AST. Scala offers combinators to plug in function application on the above production rules, which get executed on successful parsing of the rule. This is achieved by the function application combinator (^^) provided by the Scala library that enables plugging in of processing code towards evolution of the AST.

Depending on what processing I would like to do with the AST, I can choose an appropriate data structure. If I choose to perform heavy recursive traversals, tree manipulations and node annotations, the AST can be modeled as Scala algebraic data structures, which can then be inspected using pattern matching techniques. In the target application that I propose to use this, the backend contains the POJO based domain model and I would like to generate domain objects from ASTs to be used for transparent persistence in the data layer. Hence I choose to map the AST with my domain model for processing client orders.

Here are some Java classes (simplified for brevity) for abstracting a client order ..

// ClientOrder.java
public class ClientOrder {
  public enum BuySell {
    BUY,
    SELL
  }
  private String accountNo;

  private List<LineItem> lineItems = new ArrayList<LineItem>();

  // constructors, getters ..
}

// LineItem.java
public class LineItem {
  private final String security;
  private final int quantity;
  private final ClientOrder.BuySell bs;
  private final int price;

  public LineItem(String security, int quantity, ClientOrder.BuySell bs, int price) {
    this.security = security;
    this.quantity = quantity;
    this.bs = bs;
    this.price = price;
  }
  //..
  //..
}

I can plug in function application combinators with the production rules and have my AST model a ClientOrder. Remember I am plugging in the DSL to a system based on POJOs - hence I need to do some conversions between Java and Scala collections. But the final AST is a Java object to be consumed by the existing backend that does client order processing.

import scala.util.parsing.combinator.syntactical._
import org.dg.biz.ClientOrder
object OrderDSL extends StandardTokenParsers {

  def scala2JavaList(sl: List[LineItem]): java.util.List[LineItem] = {
    var jl = new java.util.ArrayList[LineItem]()
    sl.foreach(jl.add(_))
    jl
  }

  lexical.delimiters ++= List("(", ")", ",")
  lexical.reserved += ("buy", "sell", "shares", "at", "max", "min", "for", "trading", "account")

  def instr: Parser[ClientOrder] =
    trans ~ account_spec ^^ { case t ~ a => new ClientOrder(scala2JavaList(t), a) }

  def trans: Parser[List[LineItem]] =
    "(" ~> repsep(trans_spec, ",") <~ ")" ^^ { (ts: List[LineItem]) => ts }

  def trans_spec: Parser[LineItem] =
    buy_sell ~ buy_sell_instr ^^ { case bs ~ bsi => new LineItem(bsi._1._2, bsi._1._1, bs, bsi._2) }

  def account_spec =
    "for" ~> "trading" ~> "account" ~> stringLit ^^ {case s => s}

  def buy_sell: Parser[ClientOrder.BuySell] =
    ("buy" | "sell") ^^ { case "buy" => ClientOrder.BuySell.BUY
                          case "sell" => ClientOrder.BuySell.SELL }

  def buy_sell_instr: Parser[((Int, String), Int)] =
    security_spec ~ price_spec ^^ { case s ~ p => (s, p) }

  def security_spec: Parser[(Int, String)] =
    numericLit ~ ident ~ "shares" ^^ { case n ~ a ~ "shares" => (n.toInt, a) }

  def price_spec: Parser[Int] =
    "at" ~ ("min" | "max") ~ numericLit ^^ { case "at" ~ s ~ n => n.toInt }
}

Here is a function within OrderDSL that uses the AST model ..

def doMatch() {
  val dsl =
    "(buy 100 IBM shares at max 45, sell 40 Sun shares at min 24,buy 25 CISCO shares at max 56) for trading account \"A1234\""

  instr(new lexical.Scanner(dsl)) match {
    case Success(ord, _) => processOrder(ord) // ord is a ClientOrder
    case Failure(msg, _) => println(msg)
    case Error(msg, _) => println(msg)
  }
}

The basic idea of polyglotism is to harness the power of multiple languages in their respective strength areas. Languages like Scala, despite being statically typed offer lots of flexibilities and conciseness. Offering the strong features of both OO and functional paradigms, Scala shines in providing parser combinator libraries straight out of the box. And the above example shows how easy it is to get a DSL working if we use the power of combinators. And the best part is that you can still use your existing Java objects to do the heavy backend lifting - truly it is the single platform of JVM that unifies the diversity of multiple programming languages.

16 comments:

SaskoM said...: Isn't the do in the last code snippet supposed to be a match, or I missed something in the Scala language?
Anyway, excellent post, expecting more great stuff from this blog.

And, not really related with the post, I still like the Haskell parser combinators a bit more, they seem to me a just a bit more 'elegant' :-).; Monday, April 14, 2008 at 6:31:00 PM GMT+5:30
Unknown said...: oops! sure it is "match" .. Fixed it ..; Monday, April 14, 2008 at 6:36:00 PM GMT+5:30
Germán said...: For this:

numericLit ~ ident ~ "shares" ^^ { case n ~ a ~ "shares" => (n.toInt, a) }

I think it could be improved like this (though I haven't tested it):

numericLit ~ ident ~ "shares" ^^ { case n ~ a ~ _ => (n.toInt, a) }

...to avoid repeating a literal which can't be any other thing, or even:

(numericLit ~ ident) <~ "shares" ^^ { case n ~ a => (n.toInt, a) }

...to forget about it once you know you have the right construct.
I've been playing around with this stuff and it's really cool.; Monday, April 14, 2008 at 6:44:00 PM GMT+5:30
Unknown said...: @German:
Yes, it can be improved that way. Actually I had taken this example from a much larger DSL and shortened it for brevity. In the original DSL, I had to deal with shares(equities) and fixed incomes (fi) and other types of securities. Hence I could not treat that part as don't care. But in this context, it makes perfect sense to do what u suggested. Thanks.; Monday, April 14, 2008 at 6:50:00 PM GMT+5:30
Luc Duponcheel said...: very interesting indeed

by the way: you mention that Scala parsers are implemented as monads, but the combinators you use do not need the full power of monads. Parsers can be implemented as applicative functors (or, equivalently, monoidal functors). They are more general than monads [ and, as a consequence, cover more practical cases (e.g. error correcting parsers) ]

by the way the combinators in the wiki you refer to (I, K and S) are also applicative

I enjoyed every bit of this post

Luc; Monday, April 14, 2008 at 9:32:00 PM GMT+5:30
Anonymous said...: Ah, i generally dislike scala, since it seems to me that they are falling into the trap of being too powerful (like C++) with conflicting and incompatible worldviews in the same code base, but then i come across something sexy like the Parser Combinators library.

Still can't stand operator overloading.; Tuesday, April 15, 2008 at 4:33:00 AM GMT+5:30
James Iry said...: Anonymous - as an exercise take the parser defined here and replace all the "operators" with words. You might rethink your position on operator overloading when you see the results.; Wednesday, April 16, 2008 at 7:47:00 AM GMT+5:30
Brian Reilly said...: One thought that I've been having about DSLs is that users will make errors when writing in the language. I think that having domain specific error messages is an important part of a DSL.

I hear people talking about writing DSLs on top of Groovy or Ruby and wonder if that really works. If someone makes an error, the feedback will make sense to someone who knows Groovy/Ruby, but maybe not to someone who is only trained in the DSL. Still, I think people use Groovy/Ruby because it's easier than creating their own parser. What you end up with seems more like a fluent interface than a DSL to me, since the actual language is still Groovy or Ruby.

It looks like using Scala the way you describe makes it as easy to create a true DSL as it would be to create a fluent interface in Groovy/Ruby. Would this technique also also allow you to report domain specific error messages so that users don't have to worry that it's actually implemented with Scala? Or do you think that the work necessary to provide domain specific error messages is roughly the same no matter how the DSL is implemented?; Friday, April 18, 2008 at 8:17:00 PM GMT+5:30
Anonymous said...: @Brian:

There is no requirement anywhere that says that a DSL has to look like written English. There's also no requirement that says a DSL must be so simple that a non-techie can write it (although they should probably be able to read it). If there's some syntax in there, that's ok. The important thing is that it looks natural for the domain in question and the language it's coded in. Also, Groovy and Ruby are not restricted to just using fluent interfaces (although they don't allow you to be quite as free-form as Scala either).

Some people have gone as far as to say that DSLs are really just about writing readable code. I would say it's taking that idea and putting it on steroids.; Friday, April 18, 2008 at 9:31:00 PM GMT+5:30
Brian Reilly said...: @anonymous:

I agree completely. You make a good point. I was assuming that the people writing in the DSL are the domain experts. It's probably often useful enough as a tool for programmers to be able to write in the DSL and have the domain experts be able to read/understand/verify the content.

I'm still curious about the ability to provide domain specific error messages using this technique. No matter who is actually writing in the language, it would make sense to have any error messages speak the same high level language.; Saturday, April 19, 2008 at 1:42:00 AM GMT+5:30
Antony Stubbs said...: Is it possible to refactor out all the repeated magic strings?; Monday, April 21, 2008 at 5:08:00 AM GMT+5:30
tawek said...: This comment has been removed by the author.; Thursday, April 24, 2008 at 2:17:00 PM GMT+5:30
Unknown said...: This post has been picked up by Artima forum .. enjoy the discussion ..; Friday, April 25, 2008 at 10:04:00 AM GMT+5:30
Robert Fischer said...: Is there any support for position tracking? Specifically, I'd like to know that this particular token came from line 13, column 23 of file "fooBar.suffix".; Sunday, January 18, 2009 at 11:59:00 PM GMT+5:30
Andrew F said...: @Robert_Fischer: Yeah, there is support for position-tracking. You need to ensure that your output AST nodes (the output of the parser; the T in Parser[T]) incorporates the trait "scala.util.parsing.input.Positional".

That (only) gives your parser output the CAPABILITY to store a line/column (via a member called 'pos'). In order to populate 'pos', there is a method in Parsers called 'positioned'.

If you wrap your parse combinator in 'positioned', it will assign the 'pos' property for you.

e.g., expr = positioned(term~rep('+'~term))

The 'pos' variable can spit out a nice "your error is here" string, but doesn't include the filename, unfortunately. To include filename, you'd need to recreate a similar mechanism for yourself.; Friday, August 28, 2009 at 8:02:00 PM GMT+5:30
Berlin Brown Discussions said...: Do you have a download with the source, sometimes it hard to read through the blog and cherry pick the code.; Tuesday, June 8, 2010 at 9:09:00 AM GMT+5:30