Monday, May 3, 2010

A Simple DSL: Tokenization to Evaluation

The purpose of this post is twofold: first, to present a bit of code. But second, and perhaps more important, it’s to give a shout out to one of my favorite F# blogs, Ashley Feniello’s “Code Monkey Have Fun”.

In particular, his series on Scheme in F#. It’s full of good code and wisdom, and well worth reading even if you’re not interested in the Scheme language per se.

Not being glib with compiler terminology, I’m not sure how to describe today’s post, but here goes:

A lot of times, I find myself writing domain specific languages (DSLs) and file formats that are too complex for simple lines of data, but too simple to require full-blown parsers, parse trees, and compilers. So I usually end up with something that can go directly from tokenization to evaluation. The evaluation usually happens once, creating some data structure that I’m actually going to use at runtime. Another feature I’ve found common to such situations is that the DSL is conveniently specified using a delimited prefix syntax (e.g. like LISP or Scheme), but is easily evaluated using a stack-based interpreter (e.g. like FORTH).

This post presents a small example of the above process. It implements a small interpreter which analyzes and evaluates LISP-like prefix expressions involving integers, multiplication, and division. For example: (+ (* 1 2) (* 3 4) evaluates to a data element containing the number 14. To make clear what is happening, be sure and examine the value of tokenList before it is evaluated.

This simple example is not particularly useful; in actual use, both the expression content and the result would usually be something more complex. For example, the expression might represent some behavior of an NPC in a video game, and the evaluated content would be the data structures necessary to implement that behavior as the game is running.

In any event, I started with Ashley Feniello’s Scheme tokenizer, modifying it to produce a postfix list of tokens representing the prefix expression, and then a stack-based evaluator to do the calculation. I simplified and modified Ashley’s tokenizer extensively, and – given that the original is very good – most likely not for the better; I take full responsibility for any bugs or bad idioms that may have crept in.

I never cease to be amazed at how much one can accomplish with so little in F#.

This weekend my basement participated, in it’s own small way, in the flooding in the eastern central United States. The basement is now dry! (or at least merely damp), but I fear my concentration may have suffered from a near-sleepless night spent helping bail water and baby-sit the sump-pump. So…

As always, presented "as-is" and without warranty or implied fitness of any kind; use at your own risk.
open System

// This demonstrates how to combine a simple tokenizer
// with a simple stack-based evaluator. That is,
// it goes directly from tokenization to evaluation
// without building a parse tree.
// This technique is suitable for some
// domain-specific languages.

// This little tokenizer is a gross simplification of
// Ashley Feniello's Scheme tokenizer posted 2010.01.15 at
// http://blogs.msdn.com/ashleyf/default.aspx
//
// My version recognizes only:
// whitespace, '(', ')', '+', '*', and integers.
//
// Unlike Ashley's tokenizer, mine returns the token list
// reversed, since that is what my evaluator expects.
//
// Please read Ashley Feniello's entire series on
// Scheme in F#! Trust me, you'll be glad you did.
// It demonstrates a more complete Scheme
// implementation in F#, that could
// be used either as an embedded Scheme evaluator
// or modified to implement other
// domain-specific languages

type Token =
| Plus // Add.
| Times // Multiply.
| N of int // A number.
| Stop // Stack segment delimiter.
// Note: delimiter is not needed if arity is fixed.


let tokenize source =
let rec number a l =
match l with
| (')'::_) as t -> a, t
| w::t when Char.IsWhiteSpace(w) -> a, t
| [] -> a, []
| h::t when Char.IsDigit(h) ->
number ((a*10)+(Int32.Parse(h.ToString()))) t
| _ -> failwith "Unexpected character."
let rec tokenize' a = function
| w::t when Char.IsWhiteSpace(w) -> tokenize' a t
| '('::t -> tokenize' a t
| ')'::t -> tokenize' (Stop::a) t
| '+'::t -> tokenize' (Plus::a) t
| '*'::t -> tokenize' (Times::a) t
| h::t when Char.IsDigit(h) ->
let n,t' = number (Int32.Parse(h.ToString())) t
tokenize' (N(n)::a) t'
| [] -> a
| _ -> failwith "Unexpected character."
tokenize' [] source


// A simple stack-based evaluator.
// In a real application this might evaluate
// a domain-specific language to compute some
// value or generate data.

let eval l =
let rec func (f:int->int->int) a = function
| [] -> failwith "Unexpected end."
| h::t ->
match h with
| N(n) -> func f (f a n) t
| Stop -> N(a)::t
| _ -> failwith "Unexpected token."
let rec eval' a = function
| [] -> a
| h::t ->
match h with
| Plus -> eval' (func (+) 0 a) t
| Times -> eval' (func (*) 1 a) t
| _ -> eval' (h::a) t
eval' [] l

let tokenList = tokenize (List.ofSeq "(+ (*1 2) (* 3 4))")
let evalList = eval tokenList // Should be [ N(14) ].

printfn "All done."

System.Console.ReadLine() |> ignore

2 comments:

AshleyF said...

Hey, thanks much for the "shout out" :-) TechNeilogy is an awesome blog!
I see you're a Go player as well; nice. Funny I remember Bill Hail's mentioning in his book into, "...if C is the Chess of languages, then Scheme is more like Go."

TechNeilogy said...

“Go uses the most elemental materials and concepts -- line and circle, wood and stone, black and white -- combining them with simple rules to generate subtle strategies and complex tactics that stagger the imagination.” --Iwamoto Kaoru

Go is to programming as programming is to Go.

Thanks for your comment! Over a year ago, when I first heard about F#, my first response was “Why didn’t Microsoft just create Visual Scheme?” Since then, however, my attitude has changed; F# is a wonderful blend of the pure and the practical. But as your Scheme in F# demonstrates, it is possible to have both! Keep up the good work.