Learning F# — A Simple Parser

- fsharp

A couple of weeks have passed since my last post and I still really enjoy learning F#. So when I found this interesting post called “Emulator basics: a stack and register machine” in which the author builds a virtual machine for running simple C programs, I decided to port the code from EcmaScript.

I won’t go into details regarding the actual VM design and its instruction set, the original post already explains it nicely. Let’s focus on the type of source code we’ll be parsing instead:


  MOV RDI, 1
  MOV RSI, 2
  CALL plus

This program will start at the label main:, add 2 numbers into registers and then call the procedure defined after the label plus: to add them. The result of the addition (3) will be the exit code of this program.

The Parser

I decided to represent the result of the parser as a record type called ParseResult which consists of a list of tuples containing instructions and their optional arguments (e.g. ("add", ["RDI"; "RSI"]) or ("ret", [])) as well as a map which tracks labels and their corresponding line numbers:

type ParseResult =
    { instructions: (string * string list) list
      labels: Map<string, int> }


F#’s active patterns are one of my favorite language features. Like most newcomers to the language I tend to overuse them, though in this specific case they seem like a good fit and a lightweight alternative to parser combinators or similar solutions.

let (|Empty|Directive|LineComment|Label|Instruction|) (line : string) =
    if line = "" then Empty
    else if line.StartsWith(".") then Directive
    else if line.StartsWith(";") || line.StartsWith("#") then LineComment
    else if line.Contains(":") then Label (line.Split(":").[0])
    else Instruction (line.Split([|';'; '#'|]).[0])

Active patterns allow us to define named partitions of data, which can be used in pattern matching expressions just like constructors of discriminated unions. This specific pattern first checks if the line is empty, in which case it returns the Empty identifier. Lines that start with a . character are directives (Directive) and will be ignored, just like full line comments (LineComment) starting with ; or #. The Label pattern matches any line containing a : and will only return the part before the colon (e.g. main: becomes main). All other lines represent instructions and will return their content stripped of comments.


The actual parsing logic is relatively simple:

let parse (source : string) : ParseResult =
    |> Array.fold (fun result (line : string) ->
        match line.Trim() with
        | Directive | Empty | LineComment -> result
        | Label l ->
            let labels = Map.add l result.instructions.Length result.labels
            { result with labels = labels }
        | Instruction i ->
            let instruction = parseInstruction i
            { result with instructions = (instruction :: result.instructions) }
    ) { instructions=[]; labels=Map.empty }
    |> fun result -> { result with instructions = List.rev result.instructions }

The input source is split into lines and then Array.fold is used for pattern matching every line and updating the ParseResult record as needed. Directive, Empty and LineComment have no effect on the program, so the result will be returned unmodified. Label uses a copy and update record expression to add a new value to the map, while Instruction does something similar for the list of tuples, using the following parseInstruction helper function:

let private parseInstruction (instruction : string) =
    let operation = instruction.Split(" ").[0].ToLower()
    let operands =
        |> Array.map (fun (s : string) -> s.Trim())
        |> Array.filter ((<>) "")
        |> Array.toList

    operation, operands

This splits the operation and its operands and then cleans up the latter.

The Result

Calling this with the example program listed above, gives the following output in F# Interactive:

parse source;;
 val it : ParseResult =
   {instructions =
     [("add", ["RDI"; "RSI"]); ("mov", ["RAX"; "RDI"]); ("ret", []);
      ("mov", ["RDI"; "1"]); ("mov", ["RSI"; "2"]); ("call", ["plus"]);
      ("ret", [])];
    labels = map [("main", 3); ("plus", 0)];}


Like the original parser this is brittle. On the other hand it shows off a nice F# feature (active patterns) and the whole code weighs in at only about 30 SLOC. Given that I’m still relatively new to the language this is probably not the nicest piece of code, but it was fast to write and pretty much worked on first try once the types checked out. Stay tuned for the next installment in this series where I’ll tackle the interpreter part of the original post.