parsing any file using parsimonious (PEG parser)

Introduction

Every now and then we are facing the problem of parsing a text file in some weird format. Usually, we default to regex to get the data we need from that file. That works but is not very flexible and hard to extend. One off solutions can be quickly done that way but if we need to parse more data from more complex files than regex becomes too complex pretty quickly.

PEG parsers

Parsing Expression Grammars are designed for just that purpose. Think of them as a regex on steroids.

there seem to be quite a few peg parsers available for python. here is an extensive list:
https://tomassetti.me/parsing-in-python/

I tried a few and settled with parsimonious:
https://github.com/erikrose/parsimonious

the syntax is readable and makes some sense straight away, it doesn’t require any compiled libraries and seems to be small and fast. EDIT: recent update relies on compiled regex library.

how it works

Parsimonious needs a grammar to know how to parse a document. This is the first step in the process of extracting the information. You can write grammar to just parse what you need or everything that is in the file. Depending on what you need.

I’m not going to show a trivial example here (you can find it in the docs) but will explain how I parse a nuke file. Nuke stores the entire node graph in a text file. The most fundamental building block is a single node

nuke node
what I’m interested in are nuke nodes in the file. Below is an example of a (stripped down) nuke Transform node exported from nuke (the actual nodes can get much more complicated than that)

Transform {
 translate {22 33}
 rotate -40
 center {2098 908}
 shutteroffset centred
 name Transform2
}

we can see that the node starts on a new line with the node type and a curly brace followed by attributes and their values in separate lines and ends with closing curly brace. The grammar for it could be something like this:

node = type spaces obrace newline attrib+ cbrace newline

it basically lists all the components expected in the node including white spaces and newlines.”+” is the same as regex and means that there is one or more attribute. So far easy 🙂

next we need to declare what all those components are:

type            = ~"[A-z0-9]+"i
obrace          = "{"
cbrace          = "}"
attrib          = spaces knob spaces value newline
newline         = ~"\n+"

we can use literal strings "{" or regex (starts with ~) or other declared types or a combination of them.

the “type” above is declared as regex and it allows only letters and numbers to be part o the type. “i”- means case insensitive regex.

and we keep going like that breaking the grammar into smaller and smaller chunks.

attrib declares two new grammars:

knob            = ~"[A-z0-9_]+"i
value           = single_value / array

at this point, we getting into the real magic of the system. a value of the attribute can be a single value or an array (nuke uses curly braces for arrays)

Transform {
 translate {22 33}
 rotate -40
 center {2098 908}

both translate and center are arrays with two elements in them.

our declaration of “value” is telling the system that there is a choice. we can declare more choices like A / B / C / D. The order matters here and the first from the choices that match wins.

an array can be parsed with the following grammar:

array           = obrace (value spaces?)+ cbrace

we start with an opening brace then we have some values separated by spaces and a closing brace. “?” means that space may not be there (end of the array).

because we used “value” inside the array and a value can be an array this solves the problem of nested arrays for us. No crazy syntax to get a list of lists of lists 🙂

now let’s look closer at the “single_value”:

single_value    = frame_range / number / curve / str / quoted_str

again a choice of a few possible values.

if we miss here one of the possible values and we try to parse a script that has a value that cannot be parsed by any of those the entire node will not be matched. This is probably the biggest problem with this approach as it might be hard to figure out what tripped the match.

In the end, we end up with a couple of lines of the gramma that can parse any crazy nuke syntax including curves. This is most likely not a complete implementation but works with the nodes I need to parse and can be easily extended.

one             = (node / junk )+
junk            = ~".*" newline
node            = type spaces obrace newline (attrib/empty)+ cbrace newline
type            = ~"[A-z0-9]+"i
obrace          = "{"
cbrace          = "}"
attrib          = spaces knob spaces value newline
empty           = spaces? newline
knob            = ~"[A-z0-9_]+"i
value           = single_value / array
single_value    = frame_range / number / curve / str / quoted_str
frame_range     = &spaces? int "-" int &spaces?
number          = float / int
array           = obrace (value spaces?)+ cbrace
curve           = obrace "curve" spaces (letter spaces)+ keyframe spaces curve_value+ cbrace
keyframe        = "x" int
curve_value     = spaces? number spaces?
int             = &spaces? ~"[-]*[0-9]+" &spaces?
float           = &spaces? ~"[-]*[0-9]*\.[0-9]+" &spaces?
str             = &spaces? ~"[A-Za-z0-9-_></.:+\(\)]+" &spaces?
quoted_str      = &spaces? ~"(\"([^\"]|\n)*\")" &spaces?
letter          = ~"[A-Za-z]"
newline         = ~"\n*"
spaces          = ~"\s+"


the gramar is just a regex on steroids with a lot of groups but the syntax is much more readable.

creating meaningful data structure out of the parsed data


Once the parser is done we end up with some parsed data structure. Each declared element will be in this data structure, meaning we will have entries for all new lines, braces and so on. A bit of a mess to extract the information we want. The good thing is that the system comes with a visitor class that allows easily converting to a desired data structure.

The NodeVisitor requires us to declare visitors for the elements we are interested in and will take care of traversing the entire node tree. In my case I decided to return each node as a dictionary:

{
    type: 'Transform',
    attrs: {
        'translate': [22,33],
        'rotate' : -40,
}

to do that I need to define visitor for “node” defined in my grammar:

def visit_node(self, node, visited_children):
    attrs = dict()
    for a in visited_children[4]:
        a = a[0]
        if a:
            attrs.update(a)
    return {'type': visited_children[0].text, 'attrs': attrs}

the visitor gives me access to all children of the node in the same order they were defined:

node            = type spaces obrace newline (attrib/empty)+ cbrace newline

This is one thing I do not like about the system but I see no way around it. Every time you modify the grammar you need to check if your indices are still correct. In this example to get attributes I want I have to access element 4 in the visited_children.

Because there might be one or more attributes (+) visited_children[4] is a list. Additionally, any grouping with parenthesis creates another list so in this case, we have a list with another list inside. The list created by parenthesis can be ignored hence a = a[0] in the above example code.

element 0 contains the type. The rest are just braces and new lines that can be ignored.

I want my attributes to be dictionaries too (I use update in the above example to collect them all) so we need another visitor:

def visit_attrib(self, node, visited_children):
    return {visited_children[1].text: visited_children[3]}

from the declaration of attrib in the grammar we can see that we only need element 1 (knob) and 3 (value):

attrib          = spaces knob spaces value newline

the entire visitor to get the correct types and drop unnecessary list created by grouping:

def generic_visit(self, node, visited_children):
    return visited_children or node
 
def visit_junk(self, node, visited_children):
    return None
 
def visit_one(self, node, visited_children):
    return [n[0] for n in visited_children if n[0]]  # filter out "junk"
 
def visit_node(self, node, visited_children):
    attrs = dict()
    for a in visited_children[4]:
        a = a[0]
        if a:
            attrs.update(a)
    return {'type': visited_children[0].text, 'attrs': attrs}
 
def visit_empty(self, node, visited_children):
    return None
 
def visit_attrib(self, node, visited_children):
    return {visited_children[1].text: visited_children[3]}
 
def visit_value(self, node, visited_children):
    return visited_children[0]
 
def visit_curve_value(self, node, visited_children):
    return visited_children[1]
 
def visit_number(self, node, visited_children):
    return visited_children[0]
 
def visit_single_value(self, node, visited_children):
    return visited_children[0]
 
def visit_curve(self, node, visited_children):
    return {'key': visited_children[4][1], 'values': visited_children[6]}
 
def visit_int(self, node, visited_children):
    return int(node.text)
 
def visit_float(self, node, visited_children):
    return float(node.text)
 
def visit_array(self, node, visited_children):
    values = [v[0] for v in visited_children[1]]
    return values
 
def visit_str(self, node, visited_children):
    return str(node.text)
 
def visit_quoted_str(self, node, visited_children):
    return str(node.text[1:-1])
 
def visit_frame_range(self, node, visited_children):
    return (visited_children[1], visited_children[3])

In the end, we have a generic nuke parser in under 100 lines of clean code.